Extracting text from PDFs using python and pdftotext

The answer was reasonably simple but it was very gruelling to obtain ;-). Firstly, the false leads:

1) Prescript proved to be an out-of-date, unsupported waste of time.

2) Ghostscript has never had much emphasis on user-friendliness or documentation. Was hoping to use its pdf2ascii functionality. Can’t remember precisely what happened but I think it only generated error messages for me.

3) pyPdf looks promising (the text extract functionality is still quite recent) but it didn’t get the text in the correct order – should probably revisit it later:

import pyPdf
"""http://pybrary.net/pyPdf/"""

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    #content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

print getPDFContent("pdfs/test.pdf")

—-

But I repeat – watch this option for the future. The developer is right onto it, as can be seen from the comment for the extractText method from the pdf.py module:

# Locate all text drawing commands, in the order they are provided in the
# content stream, and extract the text. This works well for some PDF
# files, but poorly for others, depending on the generator used. This will
# be refined in the future. Do not rely on the order of text coming out of
# this function, as it will change if this function is made more
# sophisticated.
#

# Stability: Added in v1.7, will exist for all future v1.x releases. May
# be overhauled to provide more ordered text in the future.
# @return a string object

http://pybrary.net/pyPdf/

4) pdftotext – bingo

Install pdftotext (a breeze in Ubuntu via Synaptic). In Windows refer to the brilliant, user-friendly documentation of Jeff Porter www.ire.org/training/nettour/pdf/PDFTOTEXT.pdf for step-by-step instructions.

http://www.foolabs.com/xpdf/download.html

pdftotext is part of XPDF – “Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called ‘Acrobat’ files, from the name of Adobe’s PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a decent C++ compiler. ” http://www.foolabs.com/xpdf/about.html

XPDF is GPL2

The python code is barely there but you can see the possibilities:

import os
os.system(“C:\\ … xpdf\\pdftotext -layout C:\\ … xpdf\\test.pdf”)
raw_input(“Finished”)

The text came out in the correct order thanks to the -format option.