\$\begingroup\$
I'm quite a newbie in Python and I want to speed up this method since it takes very long time especially when the size of the input file in Mbs. Also, I couldn't figure out how to use Cython in the for loop. I'm using this function with other functions to compare files byte by byte. Any recommendations?
# this function returns a file bytes in a list
filename1 = 'doc1.pdf'
def byte_target[filename1]:
f = open[filename1, "rb"]
try:
b = f.read[1]
tlist = []
while True:
# get file bytes
t = ' '.join[format[ord[x], 'b'] for x in b]
b = f.read[1]
if not b:
break
#add this byte to the list
tlist.append[t]
#print b
finally:
f.close[]
return tlist
200_success
143k22 gold badges185 silver badges468 bronze badges
asked Jun 4, 2015 at 18:34
\$\endgroup\$
0
\$\begingroup\$
It's not surprising that this is too slow: you're reading data byte-by-byte. For faster performance you would need to read larger buffers at a time.
If you want to
compare files by content, use the filecmp
package.
There are also some glaring problems with this code. For example, instead of opening a file, doing something in a try
block and closing the file handle manually, you should use the recommended with-resources technique:
with open[filename1, "rb"] as f:
b = f.read[1]
# ...
Finally, the function name and all variable names are very poor, and don't help the readers understand their purpose and what you're trying to do.
answered Jun 4, 2015 at 18:41
janosjanos
109k14 gold badges147 silver badges382 bronze badges
\$\endgroup\$
1
I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.
So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:
import requests, PyPDF2
url = '//www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get[url]
my_raw_data = response.content
with open["my_pdf.pdf", 'wb'] as my_data:
my_data.write[my_raw_data]
open_pdf_file = open["my_pdf.pdf", 'rb']
read_pdf = PyPDF2.PdfFileReader[open_pdf_file]
num_pages = read_pdf.getNumPages[]
ann_text = []
for page_num in range[num_pages]:
if read_pdf.isEncrypted:
read_pdf.decrypt[""]
print[read_pdf.getPage[page_num].extractText[]]
page_text = read_pdf.getPage[page_num].extractText[].split[]
ann_text.append[page_text]
else:
print[read_pdf.getPage[page_num].extractText[]]
print[ann_text]
This prints a list of strings in the PDF file from the url provided.
Just wondering if i can convert the my_raw_data variable to a readable string?
Thanks so much in advance!
One of the most popular open source OCR software is Google’s Tesseract. It takes in images as input and gives back machine encoded text. While I was going through Tesseract’s documentation, I found that tesseract only accepts images as input. So, I needed a way to convert my pdf files to images. While surfing, I came across 4 python libraries which can convert pdf to images. This made me think, why not write an article about these libraries with installation and code walkthrough. So here it is.
1. Pdf2image
Pdf2image is a python module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object. pdf2image supports 2 methods to convert pdf to images. The first one is convert_from_path which takes the path of the pdf file as an input. The second one is convert_from_bytes which accepts bytes as the input. The latter can be used for production ready code as we can directly read the pdf as bytes from cloud storage. This removes the risk of downloading the pdf to your system.
pip install pdf2image
Prerequisites:
- Windows — To install pdf2image in Windows we require the poppler binary file for windows. After downloading the poppler file we need to provide the path of the bin folder.
- Linux — To install pdf2image in Linux we can use the conda forge command to install poppler.
conda install -c conda-forge poppler
2. Pypdfium2
pypdfium2 is a Python 3 binding to PDFium, the liberal-licensed PDF rendering library authored by Foxit and maintained by Google.
Installation of pypdfium2 is straightforward and doesn’t require any dependencies.
pip3 install –no-build-isolation -U pypdfium2
3. PyMuPDF or Fitz
PyMuPDF is a Python binding for MuPDF — “a lightweight PDF and XPS viewer”. A PDF file can be converted into a number of image formats using PyMuPDF. The created image can be enlarged or diminished based on the Matrix function. The value of zoom can be configured to achieve the expected size.
pip install PyMuPDF==1.16.14
4. Pdf2jpg
Pdf2jpg is a python library which can be used to convert PDF to images. We need to provide the input and output paths for pdf and images respectively.
pip install pdf2jpg
Want to Connect?
If you have enjoyed this article, please follow me here on Medium for more stories about machine learning and computer science.
Linked In — Prithivee Ramalingam | LinkedIn