{"id":14,"date":"2007-07-19T16:46:02","date_gmt":"2007-07-19T04:46:02","guid":{"rendered":"http:\/\/p-s.co.nz\/wordpress\/?p=14"},"modified":"2007-07-19T16:58:43","modified_gmt":"2007-07-19T04:58:43","slug":"extracting-text-from-pdfs-using-python-and-pdftotext","status":"publish","type":"post","link":"http:\/\/p-s.co.nz\/wordpress\/extracting-text-from-pdfs-using-python-and-pdftotext\/","title":{"rendered":"Extracting text from PDFs using python and pdftotext"},"content":{"rendered":"<p>The answer was reasonably simple but it was very gruelling to obtain ;-).  Firstly, the false leads:<\/p>\n<p>1) Prescript proved to be an out-of-date, unsupported waste of time.<\/p>\n<p>2) Ghostscript has never had much emphasis on user-friendliness or documentation.  Was hoping to use its pdf2ascii functionality.  Can&#8217;t remember precisely what happened but I think it only generated error messages for me.<\/p>\n<p>3) pyPdf looks promising (the text extract functionality is still quite recent) but it didn&#8217;t get the text in the correct order &#8211; should probably revisit it later:<\/p>\n<pre>import pyPdf\r\n\"\"\"http:\/\/pybrary.net\/pyPdf\/\"\"\"\r\n\r\ndef getPDFContent(path):\r\n    content = \"\"\r\n    # Load PDF into pyPDF\r\n    pdf = pyPdf.PdfFileReader(file(path, \"rb\"))\r\n    # Iterate pages\r\n    for i in range(0, pdf.getNumPages()):\r\n        # Extract text from page and add to content\r\n        content += pdf.getPage(i).extractText() + \"\\n\"\r\n    # Collapse whitespace\r\n    #content = \" \".join(content.replace(\"\\xa0\", \" \").strip().split())\r\n    return content\r\n\r\nprint getPDFContent(\"pdfs\/test.pdf\")<\/pre>\n<p>&#8212;-<\/p>\n<p>But I repeat &#8211; watch this option for the future.  The developer is right onto it, as can be seen from the comment for the extractText method from the pdf.py module:<\/p>\n<p>    # Locate all text drawing commands, in the order they are provided in the<br \/>\n    # content stream, and extract the text.  This works well for some PDF<br \/>\n    # files, but poorly for others, depending on the generator used.  This will<br \/>\n    # be refined in the future.  Do not rely on the order of text coming out of<br \/>\n    # this function, as it will change if this function is made more<br \/>\n    # sophisticated.<br \/>\n    # <\/p>\n<p>\n    # Stability: Added in v1.7, will exist for all future v1.x releases.  May<br \/>\n    # be overhauled to provide more ordered text in the future.<br \/>\n    # @return a string object<\/p>\n<p>http:\/\/pybrary.net\/pyPdf\/<\/p>\n<p>4) pdftotext &#8211; bingo<\/p>\n<p>Install pdftotext (a breeze in Ubuntu via Synaptic). In Windows refer to the brilliant, user-friendly documentation of Jeff Porter www.ire.org\/training\/nettour\/pdf\/PDFTOTEXT.pdf for step-by-step instructions.<\/p>\n<p>http:\/\/www.foolabs.com\/xpdf\/download.html<\/p>\n<p>pdftotext is part of XPDF &#8211; &#8220;Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called &#8216;Acrobat&#8217; files, from the name of Adobe&#8217;s PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.<\/p>\n<p>Xpdf runs under the X Window System on UNIX, VMS, and OS\/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a decent C++ compiler. &#8221; http:\/\/www.foolabs.com\/xpdf\/about.html<\/p>\n<p>XPDF is GPL2<\/p>\n<p>The python code is barely there but you can see the possibilities:<\/p>\n<p>import os<br \/>\nos.system(&#8220;C:\\\\ &#8230; xpdf\\\\pdftotext -layout C:\\\\ &#8230; xpdf\\\\test.pdf&#8221;)<br \/>\nraw_input(&#8220;Finished&#8221;)<\/p>\n<p>The text came out in the correct order thanks to the -format option.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The answer was reasonably simple but it was very gruelling to obtain ;-). Firstly, the false leads: 1) Prescript proved to be an out-of-date, unsupported waste of time. 2) Ghostscript has never had much emphasis on user-friendliness or documentation. Was &hellip; <a href=\"http:\/\/p-s.co.nz\/wordpress\/extracting-text-from-pdfs-using-python-and-pdftotext\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-14","post","type-post","status-publish","format-standard","hentry","category-python"],"_links":{"self":[{"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/posts\/14"}],"collection":[{"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/comments?post=14"}],"version-history":[{"count":0,"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/posts\/14\/revisions"}],"wp:attachment":[{"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/media?parent=14"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/categories?post=14"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/p-s.co.nz\/wordpress\/wp-json\/wp\/v2\/tags?post=14"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}