Read pdf as bytes python

\$\begingroup\$

I'm quite a newbie in Python and I want to speed up this method since it takes very long time especially when the size of the input file in Mbs. Also, I couldn't figure out how to use Cython in the for loop. I'm using this function with other functions to compare files byte by byte. Any recommendations?

# this function returns a file bytes in a list
filename1 = 'doc1.pdf'
def byte_target(filename1):
    f = open(filename1, "rb")
    try:
        b = f.read(1)
        tlist = []
        while True:
            # get file bytes
            t = ' '.join(format(ord(x), 'b') for x in b)
            b = f.read(1)
            if not b:
                break
            #add this byte to the list
            tlist.append(t)

            #print b        

    finally:
        f.close()
    return tlist

Read pdf as bytes python

200_success

143k22 gold badges185 silver badges468 bronze badges

asked Jun 4, 2015 at 18:34

\$\endgroup\$

0

\$\begingroup\$

It's not surprising that this is too slow: you're reading data byte-by-byte. For faster performance you would need to read larger buffers at a time.

If you want to compare files by content, use the filecmp package.

There are also some glaring problems with this code. For example, instead of opening a file, doing something in a try block and closing the file handle manually, you should use the recommended with-resources technique:

    with open(filename1, "rb") as f:
        b = f.read(1)
        # ...

Finally, the function name and all variable names are very poor, and don't help the readers understand their purpose and what you're trying to do.

answered Jun 4, 2015 at 18:41

Read pdf as bytes python

janosjanos

109k14 gold badges147 silver badges382 bronze badges

\$\endgroup\$

1

I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:

import requests, PyPDF2

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)


open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()

ann_text = []
for page_num in range(num_pages):
    if read_pdf.isEncrypted:
        read_pdf.decrypt("")
        print(read_pdf.getPage(page_num).extractText())
        page_text = read_pdf.getPage(page_num).extractText().split()
        ann_text.append(page_text)

    else:
        print(read_pdf.getPage(page_num).extractText())
print(ann_text)

This prints a list of strings in the PDF file from the url provided.

Just wondering if i can convert the my_raw_data variable to a readable string?

Thanks so much in advance!

One of the most popular open source OCR software is Google’s Tesseract. It takes in images as input and gives back machine encoded text. While I was going through Tesseract’s documentation, I found that tesseract only accepts images as input. So, I needed a way to convert my pdf files to images. While surfing, I came across 4 python libraries which can convert pdf to images. This made me think, why not write an article about these libraries with installation and code walkthrough. So here it is.

Read pdf as bytes python

Image by Author

1. Pdf2image

Pdf2image is a python module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object. pdf2image supports 2 methods to convert pdf to images. The first one is convert_from_path which takes the path of the pdf file as an input. The second one is convert_from_bytes which accepts bytes as the input. The latter can be used for production ready code as we can directly read the pdf as bytes from cloud storage. This removes the risk of downloading the pdf to your system.

pip install pdf2image

Prerequisites:

  1. Windows — To install pdf2image in Windows we require the poppler binary file for windows. After downloading the poppler file we need to provide the path of the bin folder.
  2. Linux — To install pdf2image in Linux we can use the conda forge command to install poppler.

conda install -c conda-forge poppler

2. Pypdfium2

pypdfium2 is a Python 3 binding to PDFium, the liberal-licensed PDF rendering library authored by Foxit and maintained by Google.

Installation of pypdfium2 is straightforward and doesn’t require any dependencies.

pip3 install –no-build-isolation -U pypdfium2

3. PyMuPDF or Fitz

PyMuPDF is a Python binding for MuPDF — “a lightweight PDF and XPS viewer”. A PDF file can be converted into a number of image formats using PyMuPDF. The created image can be enlarged or diminished based on the Matrix function. The value of zoom can be configured to achieve the expected size.

pip install PyMuPDF==1.16.14

4. Pdf2jpg

Pdf2jpg is a python library which can be used to convert PDF to images. We need to provide the input and output paths for pdf and images respectively.

pip install pdf2jpg

Want to Connect?

If you have enjoyed this article, please follow me here on Medium for more stories about machine learning and computer science.

Linked In — Prithivee Ramalingam | LinkedIn

How do I read a PDF byte in Python?

“python convert pdf to bytes” Code Answer.
file = open('new.pdf', 'wb').
for line in open('code.txt', 'rb'). readlines():.
file. write(line).
file. close().

How do I convert a PDF to bytes?

You need to follow the following steps for converting a Byte Array to a PDF file..
Load input file..
Initialize byte array..
Load input image into Byte Array..
Initialize an instance of Document class..
Add image on a PDF page..
Save output PDF File..

Can Python read a PDF?

You can work with a preexisting PDF in Python by using the PyPDF2 package. PyPDF2 is a pure-Python package that you can use for many different types of PDF operations. By the end of this article, you'll know how to do the following: Extract document information from a PDF in Python.

How do I convert a PDF to text in Python?

Steps to Convert PDF to TXT in Python.
Open a new Word document..
Type in some content of your choice in the word document..
Now to File > Print > Save..
Remember to save your pdf file in the same location where you save your python script file..
Now your . pdf file is created and saved which you will later convert into a ..