Datareign

The pdftotext command is a very useful utility that reads a PDF file and dumps its contents to a text file.

Synopsis Display

if you issue the command without arguments, you'll get a helpful synopsis…

 $> pdftotext
 pdftotext version 3.01
 Copyright 1996-2005 Glyph & Cog, LLC
 Usage: pdftotext [options] <PDF-file> [<text-file>]
   -f <int>          : first page to convert
   -l <int>          : last page to convert
   -layout           : maintain original physical layout
   -raw              : keep strings in content stream order
   -htmlmeta         : generate a simple HTML file, including the meta information
   -enc <string>     : output text encoding name
   -eol <string>     : output end-of-line convention (unix, dos, or mac)
   -nopgbrk          : don't insert page breaks between pages
   -opw <string>     : owner password (for encrypted files)
   -upw <string>     : user password (for encrypted files)
   -q                : don't print any messages or errors
   -cfg <string>     : configuration file to use in place of .xpdfrc
   -v                : print copyright and version info
   -h                : print usage information
   -help             : print usage information
   --help            : print usage information
   -?                : print usage information
 $> _

Piping Output

If you supply a hyphen after the input filename, pdftotext dumps to stdout, so you can pipe to another programme…

$>pdftotext op.pdf -
A PDF DOCUMENT HEADLINE
ANOTHER HEADING
Text on the front page

More text and yet more text and text again. More text and yet more text and text again.More text and 
yet more text and text again. More text and yet more text and text again.More text and yet more text
and text again.

More text and yet more text and text again. More text and yet more text and text again. More text
and yet more text and text again. More text and yet more text and text again. More text and yet more
text and text again. More text and yet more text and text again. More text and yet more text and text
again. More text and yet more text and text again. More text and yet more text and text again. More
text and yet more text and text again.
:
:
:
...etc.

Implementations

pdftotext is generally included in current Linux implementations. If your distribution does not include it, You can obtain the complete xpdf package from theFoolabs download page.

For OS-X, pdftotext on its own can be obtained from Carten Blum's site. Look for the Installer Packages section for the download.

Last modified: 2009/02/01 10:51