Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Excalibur (1) is also an alternative. It’s great! The installation process was lackluster though with multiple dependency issues on a M1 MacOS, Ubuntu and WSL, YMMV.

1) https://excalibur-py.readthedocs.io



Why do these packages insist on involving databases, web servers, etc.?

Just give me a CLI package that takes a PDF and gives me text file as output.


You might be interested in the library underneath, called Camelot:

https://camelot-py.readthedocs.io/en/master/

It's usable from Python or via a CLI.


>Why do these packages insist on involving databases, web servers, etc.?

wholeheartedly agree, and ... give a try to:

    pdftotext -layout somePDF.pdf -


Thanks! That's what I've been using, but some tables give pdftotext problems :-(


I've been using Excalibur/Camelot in production. It has been great (considering how non-standard PDF tables are).

You just cannot approach it in a fire-and-forget way. It has two modes of operation and various PDF "styles" can respond differently to each mode.

If you have a series of similarly-structured PDFs, try to import them manually (e.g. using IPython), take note of which mode worked better, possibly some adjustments (detection thresholds). Then you can pretty much automate with these collected parameters.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: