Excalibur (1) is also an alternative. It’s great! The installation process was l...

1024core · on Feb 27, 2022

Why do these packages insist on involving databases, web servers, etc.?

Just give me a CLI package that takes a PDF and gives me text file as output.

pjscott · on Feb 27, 2022

You might be interested in the library underneath, called Camelot:

https://camelot-py.readthedocs.io/en/master/

It's usable from Python or via a CLI.

ur-whale · on Feb 27, 2022

>Why do these packages insist on involving databases, web servers, etc.?

wholeheartedly agree, and ... give a try to:

    pdftotext -layout somePDF.pdf -

1024core · on Feb 28, 2022

Thanks! That's what I've been using, but some tables give pdftotext problems :-(

odiroot · on Feb 27, 2022

I've been using Excalibur/Camelot in production. It has been great (considering how non-standard PDF tables are).

You just cannot approach it in a fire-and-forget way. It has two modes of operation and various PDF "styles" can respond differently to each mode.

If you have a series of similarly-structured PDFs, try to import them manually (e.g. using IPython), take note of which mode worked better, possibly some adjustments (detection thresholds). Then you can pretty much automate with these collected parameters.