Extracting Text from PDFs

Dumping Everything into a Text File for Now

On OS X, install Ghostscript by:

brew install gs

Use the `txtwrite` device:

Save it to an output file with the `-o` switch:

gs -sDEVICE=txtwrite -o  

Then, watch it run beautifully.


This one-liner would sometimes crash instead of “run beautifully” on ill-formatted PDFs, resulting in 0-byte output (in the end though not in the interim).

Hence, I’ll look into some more sophisticated scraping methods using PDF.js.