Extracting Text from PDFs
On OS X, install Ghostscript by:
brew install gs
“The txtwrite device will output the text contained
in the original document as Unicode.”
Save it to an output file with the `-o` switch:
gs -sDEVICE=txtwrite -o pdf_dump.txt input_file.pdf
Then, watch it run beautifully.
This one-liner would sometimes crash instead of “run beautifully” on ill-formatted PDFs, resulting in 0-byte output (in the end though not in the interim).
Hence, I’ll look into some more sophisticated scraping methods using PDF.js.