Home » Blog » PDF Tips » How to Extract Links, URL & HyperLinks from PDF File

How to Extract Links, URL & HyperLinks from PDF File

author
Published By Nilesh Kumar
Debasish Pramanik
Approved By Debasish Pramanik
Published On May 20th, 2024
Reading Time 9 Minutes Reading
Category PDF Tips

A PDF file contains various media components like images, text, links, and videos. It makes the PDF file a versatile file format. Sometimes, users want to extract specific components from the PDF file. In this article, we will discuss different methods to extract links from PDF files. Whether you call them URLs, hyperlinks, or web links, you can export them in the form of a separate file.

Table of ContentsHide

Reasons to Extract Links From PDF Files

There are various reasons why removing links from a PDF file can be helpful.

  • By extracting links, one can quickly access documents or websites that are referenced without having to hunt through the PDF for URLs.The document is made more interactive and user-friendly by allowing users to visit the linked sites directly.
  • For additional research or citation management, scholars or students can create a list of all external references or sources cited in the PDF.
  • Extracting links from papers makes it easier to update and verify any broken or out-of-date links. In order to integrate content into other systems or apps using APIs or custom scripts, developers may need to extract links.
  • Businesses can keep an eye on the links inside PDFs to make sure they’re adhering to any regulations or business policies.
  • Links can be extracted and scanned by security teams to make sure they don’t point to harmful websites.

You can improve the usability, accessibility, and manageability of the data in PDF documents by extracting links.

Methods 1: Extract Links from PDF Files using Python

There is no such official method to extract hyperlinks from PDF files. Therefore, users often use Python and JavaScript to export URLs from PDF documents. Of course, non-programmers will have issues with these methods. So, we have tried to simplify the process as much as possible. For this process, you must have Python installed on your device.

Here is the process to extract links from PDF files:

  1. Use Python with libraries like PyPDF2 or pdfplumber to extract text from the PDF.
  2. Then, use regular expressions to search for URLs within the extracted text. The `re` library in Python is useful for this.

The Python code is given below:

import PyPDF2 import re pdf_path = “your_pdf_file.pdf”

#1 Open the PDF file

pdf_file = PyPDF2.PdfFileReader(open(pdf_path, ‘rb’))

#2 Extract text from the PDF

pdf_text = “” for page_num in range(pdf_file.getNumPages()): pdf_text += pdf_file.getPage(page_num).extractText()

#3 Use regular expression to find URLs

urls = re.findall(r’http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+’, pdf_text)

#4 Print the URLs

for url in urls: print(url)

Method 2. Extract Links from PDF Files using Adobe Acrobat Reader DC

Using Adobe Acrobat Reader DC you can manually extract hyperlinks from PDF files. Here is the detailed step-by-step guide:

  • Activate and start your computer’s Adobe Acrobat Reader DC.
  • It is available for free download from the Adobe website if it isn’t already installed.
    Choose “File” from the top menu bar.
  • From the dropdown menu, choose “Open”.
  • Click “Open” after finding and selecting your PDF file’s location.
  • Navigate the document by scrolling to identify the pages that have hyperlinks that you wish to extract.
  • Move your mouse pointer over the text or image that contains the embedded link. Adobe Acrobat Reader will show an area as a hyperlink to denote that it is clickable.
  • Using the right mouse button, select the highlighted link.
  • Select “Copy Link Address” from the context menu that displays. The hyperlink’s URL gets copied to your clipboard in this action.
  • Open a text file in a text editor.
  • To paste the copied URL, press Ctrl+V on a Windows computer.
  • Additionally, Repeat the previous steps for each link in the document.
  • After you click on each link, you may copy the link URL from the right-click menu and paste it into your text document.

You can manually copy each link by following the above instructions if Adobe Acrobat Reader DC does not offer an automated method to extract all links at once. For documents with a reasonable amount of connections, this technique works well and guarantees that you will receive the precise URLs that you require.

Method 3. Extract Links from PDF Files using the Command Line Tools

This is a comprehensive guide on how to use the command-line programs pdfgrep, pdftohtml with grep, and pdftotext with grep to extract hyperlinks from PDF files on a Windows system.

Sub method 1 Using pdfgrep

  • pdfgrep is not natively available on Windows, but you can use it through the Windows Subsystem for Linux (WSL) or Cygwin.

Using WSL

  • Setup WSL:
  • As an administrator, launch PowerShell and execute

wsl –install

Ubuntu will be the default distribution when WSL is installed using this technique you can easily extract links from PDF files.

In WSL, install pdfgrep:

  • Launch Ubuntu from your Start menu to access the WSL terminal.
  • After installing pdfgrep, update your package lists. 

sudo apt-get update

sudo apt-get install pdfgrep

  • Navigate to the directory containing your PDF file by opening the WSL terminal. As an example: 

cd /mnt/c/Users/YourUsername/Downloads

  • To find URLs in your PDF file, run the following command:

pdfgrep -o ‘http[s]?://[^ ]+’ yourfile.pdf

Using Cygwin to extract links from PDF files

  • From cygwin.com, download and launch the Cygwin installer.
  • Make sure the pdfgrep package is selected in the Text category during setup.
  • With the Cygwin terminal open.
  • Go to the folder where your PDF file is stored. For instance:

cd /cygdrive/c/Users/YourUsername/Downloads

  • Run the command to search for URLs:

pdfgrep -o ‘http[s]?://[^ ]+’ yourfile.pdf

Sub Method 2 Using pdftohtml and grep extract hyperlinks from pdf files

  • Windows users can utilize grep from GNUWin32 and the pdftohtml tool from the Poppler tools.
  • Put Poppler on Windows by installing:
  • This GitHub repository contains the most recent Poppler binaries for Windows.
  • To a directory, such as C:\poppler, extract the zip file.
  • Use GNUWin32 to install grep.
  • Open the GnuWin32 project, download grep, and install.
  • Set GNUWin32 and Poppler as PATH variables:
  • To access ‘Properties’ right-click on ‘This PC’ or ‘My Computer’ from the desktop or File Explorer.
    Click on “Advanced System Settings” then choose “Environment Variables”.
    Under the “System variables” section, find the Path variable, select it, and then click “Edit.”
  • Enter the locations to extract links from PDF files, for example,

C:\poppler\bin and C:\Program Files (x86)\GnuWin32\bin, in the bin directories of GNUWin32 and Poppler.

  • Translate PDF to HTML:
  • Open the Command Prompt.
  • Open the PDF directory on your computer:

cd C:\Users\YourUsername\Downloads

  • Run pdftohtml

pdftohtml yourfile.pdf output.html

  • Use grep to find URLs in the HTML file:

grep -o ‘http[s]*://[^”]*’ output.html

Sub method 3 Using pdftotext and grep to extract links from pdf files

  • To download and install Poppler for Windows, use the identical procedures as in Sub method 2.
  • To download and install grep from GNUWin32, use the same procedures as in Sub method 2.

Convert a PDF to Text:

  • The Command Prompt should be open.
  • Open your PDF directory by navigating there:

cd C:\Users\YourUsername\Downloads

  • Launch pdftotext

pdftotext yourfile.pdf output.txt

Retrieve Links

  • To locate URLs in the text file, use grep

grep -o ‘http[s]*://[^ ]*’ output.txt

Limitations of the Manual Solution Extract URLs from PDF Files

The manual solution of extract links from PDF files using Adobe Acrobat Reader DC has several limitations:

  • Copying links manually takes a long time, especially for documents with many links.
  • Human error is more likely to occur, like missing a link by accident, copying the incorrect portion of the text, or failing to capture the full URL.
  • Large texts or documents with lots of pages and links are unsuitable for this strategy.
  • There is no option to extract all links at once with Adobe Acrobat Reader DC therefore you have to copy each link separately.
  • A built-in function to filter or precisely search for links does not exist. It might be difficult for users to recognize connections in dense or complex pages visually.
  • Links embedded in photos may not be recognized and copied by Adobe Acrobat Reader DC if the PDF has been scanned or contains images.

Despite being simple to use and not requiring any other software, the manual technique of extract hyperlinks from PDF Files using Adobe Acrobat Reader DC is not appropriate for activities that call for accuracy, efficiency, or managing large numbers of connections.

Modern Tip to Extract Hyperlinks from PDF Files

If you are not satisfied with the manual solution then you can go for the pro advice which can easily solve your problem by extract links from PDF files by using the powerful tool i.e. PDF Extractor Software. This tool is easy to use and requires no programming knowledge. It can export your URLs in PDF, DOC, or DOCX file format. In fact, you can even fetch links from specific PDF pages using this tool. After all, it provides multiple page options:

Conclusion

There are four methods to extract links from PDF files. Users can choose the method that suits them best. But do remember that the effectiveness of these methods can vary depending on the complexity of the PDF file and how the links are embedded in the document. Links can also be hidden behind buttons or icons and may require more advanced techniques to extract them.

Read more- Remove Watermark From Locked PDF

  author

By Nilesh Kumar

As a Chief Technical Writer, I know the technical issues faced by home and professional users. So, I decided to share all my knowledge via this blog. I love to help you with challenges while dealing with technical jargon.