Python search pdf for text

Question

Problem
I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).

Nội dung chính Show

How do you search for text in a PDF using Python?
Can you search a PDF for text?
How do I extract specific text from a PDF in Python?
How do I search for text in a PDF image?

What I've done so far
I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.

Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?

asked Jun 13, 2013 at 23:07

2

This is called PDF mining, and is very hard because:

PDF is a document format designed to be printed, not to be parsed. Inside a PDF document, text is in no particular order (unless order is important for printing), most of the time the original text structure is lost (letters may not be grouped as words and words may not be grouped in sentences, and the order they are placed in the paper is often random).
There are tons of software generating PDFs, many are defective.

Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).

An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.

So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.

I would really like to be proven wrong.

[update]

The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:

Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.
If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like pdftotext and a Bayesian filter (same kind of algorithm used to classify SPAM).

So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).

answered Jun 14, 2013 at 0:52

Paulo ScardinePaulo Scardine

69.5k10 gold badges124 silver badges149 bronze badges

4

I am totally a green hand, but this script works for me:

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("test.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
String = "Social"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

answered Jun 10, 2018 at 3:46

Emma YuEmma Yu

5035 silver badges6 bronze badges

4

I've written extensive systems for the company I work for to convert PDF's into data for processing (invoices, settlements, scanned tickets, etc.), and @Paulo Scardine is correct--there is no completely reliable and easy way to do this. That said, the fastest, most reliable, and least-intensive way is to use pdftotext, part of the xpdf set of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layout argument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all.

answered Jun 14, 2013 at 1:07

MikeHunterMikeHunter

4,0041 gold badge18 silver badges13 bronze badges

4

I recently started using ScraperWiki to do what you described.

Here's an example of using ScraperWiki to extract PDF data.

The scraperwiki.pdftoxml() function returns an XML structure.

You can then use BeautifulSoup to parse that into a navigatable tree.

Here's my code for -

import scraperwiki, urllib2
from bs4 import BeautifulSoup

def send_Request(url):
#Get content, regardless of whether an HTML, XML or PDF file
    pageContent = urllib2.urlopen(url)
    return pageContent

def process_PDF(fileLocation):
#Use this to get PDF, covert to XML
    pdfToProcess = send_Request(fileLocation)
    pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read())
    return pdfToObject

def parse_HTML_tree(contentToParse):
#returns a navigatibale tree, which you can iterate through
    soup = BeautifulSoup(contentToParse)
    return soup

pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf')
pdfToSoup = parse_HTML_tree(pdf)
soupToArray = pdfToSoup.findAll('text')
for line in soupToArray:
    print line

This code is going to print a whole, big ugly pile of tags. Each page is separated with a , if that's any consolation.

If you want the content inside the tags, which might include headings wrapped in for example, use line.contents

If you only want each line of text, not including tags, use line.getText()

It's messy, and painful, but this will work for searchable PDF docs. So far I've found this to be accurate, but painful.

answered Nov 14, 2015 at 7:38

JasTonAChairJasTonAChair

1,8981 gold badge18 silver badges30 bronze badges

2

Here is the solution that I found it comfortable for this issue. In the text variable you get the text from PDF in order to search in it. But I have kept also the idea of spiting the text in keywords as I found on this website: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f from were I took this solution, although making nltk was not very straightforward, it might be useful for further purposes:

import PyPDF2 
import textract

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def searchInPDF(filename, key):
    occurrences = 0
    pdfFileObj = open(filename,'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""
    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count +=1
        text += pageObj.extractText()
    if text != "":
       text = text
    else:
       text = textract.process(filename, method='tesseract', language='eng')
    tokens = word_tokenize(text)
    punctuation = ['(',')',';',':','[',']',',']
    stop_words = stopwords.words('english')
    keywords = [word for word in tokens if not word in stop_words and  not word in punctuation]
    for k in keywords:
        if key == k: occurrences+=1
    return occurrences 

pdf_filename = '/home/florin/Downloads/python.pdf'
search_for = 'string'
print searchInPDF (pdf_filename,search_for)

answered Dec 1, 2017 at 12:12

I agree with @Paulo PDF data-mining is a huge pain. But you might have success with pdftotext which is part of the Xpdf suite freely available here:

http://www.foolabs.com/xpdf/download.html

This should be sufficient for your purpose if you are just looking for single keywords.

pdftotext is a command line utility, but very straightforward to use. It will give you text files, which you may find easier to work with.

answered Jun 14, 2013 at 1:02

qwwqwwqqwwqwwq

6,6212 gold badges24 silver badges48 bronze badges

If you are on bash, There is a nice tool called pdfgrep, Since, This is in apt repository, You can install this with:

sudo apt install pdfgrep

It had served my requirements well.

answered Jul 13, 2020 at 12:31

Appaji ChintimiAppaji Chintimi

4961 gold badge6 silver badges17 bronze badges

Trying to pick through PDFs for keywords is not an easy thing to do. I tried to use the pdfminer library with very limited success. It’s basically because PDFs are pandemonium incarnate when it comes to structure. Everything in a PDF can stand on its own or be a part of a horizontal or vertical section, backwards or forwards. Pdfminer was having issues translating one page, not recognizing the font, so I tried another direction — optical character recognition of the document. That worked out almost perfectly.

Wand converts all the separate pages in the PDF into image blobs, then you run OCR over the image blobs. What I have as a BytesIO object is the content of the PDF file from the web request. BytesIO is a streaming object that simulates a file load as if the object was coming off of disk, which wand requires as the file parameter. This allows you to just take the data in memory instead of having to save the file to disk first and then load it.

Here’s a very basic code block that should be able to get you going. I can envision various functions that would loop through different URL / files, different keyword searches for each file, and different actions to take, possibly even per keyword and file.

# http://docs.wand-py.org/en/0.5.9/
# http://www.imagemagick.org/script/formats.php
# brew install freetype imagemagick
# brew install PIL
# brew install tesseract
# pip3 install wand
# pip3 install pyocr
import pyocr.builders
import requests
from io import BytesIO
from PIL import Image as PI
from wand.image import Image

if __name__ == '__main__':
    pdf_url = 'https://www.vbgov.com/government/departments/city-clerk/city-council/Documents/CurrentBriefAgenda.pdf'
    req = requests.get(pdf_url)
    content_type = req.headers['Content-Type']
    modified_date = req.headers['Last-Modified']
    content_buffer = BytesIO(req.content)
    search_text = 'tourism investment program'

    if content_type == 'application/pdf':
        tool = pyocr.get_available_tools()[0]
        lang = 'eng' if tool.get_available_languages().index('eng') >= 0 else None
        image_pdf = Image(file=content_buffer, format='pdf', resolution=600)
        image_jpeg = image_pdf.convert('jpeg')

        for img in image_jpeg.sequence:
            img_page = Image(image=img)
            txt = tool.image_to_string(
                PI.open(BytesIO(img_page.make_blob('jpeg'))),
                lang=lang,
                builder=pyocr.builders.TextBuilder()
            )
            if search_text in txt.lower():
                print('Alert! {} {} {}'.format(search_text, txt.lower().find(search_text),
                                               modified_date))

    req.close()

answered May 10, 2020 at 15:10

This answer follows @Emma Yu's:

If you want to print out all the matches of a string pattern on every page.
(Note that Emma's code prints a match per page):

import PyPDF2
import re

pattern = input("Enter string pattern to search: ")
fileName = input("Enter file path and name: ")

object = PyPDF2.PdfFileReader(fileName)
numPages = object.getNumPages()

for i in range(0, numPages):
    pageObj = object.getPage(i)
    text = pageObj.extractText()
   
    for match in re.finditer(pattern, text):
        print(f'Page no: {i} | Match: {match}')

answered Nov 29, 2020 at 13:53

A version using PyMuPDF. I find it to be more robust than PyPDF2.

import fitz
import re

# load document
doc = fitz.open(filename)

# define keyterms
String = "hours"

# get text, search for string and print count on page.
for page in doc:
    text = ''
    text += page.getText()
    print(f'count on page {page.number +1} is: {len(re.findall(String, text))}')

answered Nov 12, 2021 at 18:45

CamCam
1,06310 silver badges16 bronze badges
Example with pdfminer.six
from pdfminer import high_level with open('file.pdf', 'rb') as f: text = high_level.extract_text(f) print(text)
Compared to PyPDF2, it can work with cyrillic
answered Dec 28, 2021 at 0:05

RugnarRugnar
2,6862 gold badges21 silver badges29 bronze badges

How do you search for text in a PDF using Python?

To get started using it with Python, we first need to install using pip..

pip3 install PyPDF2. ... .

reader = PyPDF2.PdfFileReader(file) ... .

page = reader.getPage(PAGE_NUMBER) ... .

page_content = page.extractText() ... .

print(page_content) ... .

if search_term in page_content: ... .

for page_number in range(0, reader. ... .

page = reader.getPage(page_number).

Can you search a PDF for text?

When a PDF is opened in the Acrobat Reader (not in a browser), the search window pane may or may not be displayed. To display the search/find window pane, use "Ctrl+F".

How do I extract specific text from a PDF in Python?

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.

How do I search for text in a PDF image?

Once you use the Recognize Text tool to convert your scanned image into a usable PDF file, you can select and search through the text in that file, making it easy to find, modify, and reuse the information from your old paper documents. Select the Find text tool and enter text to search in the Find field.

programming python Read PDF Python PyPDF2 Pdftotext Xpdf

Python search pdf for text

How do you search for text in a PDF using Python?

Can you search a PDF for text?

How do I extract specific text from a PDF in Python?

How do I search for text in a PDF image?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội