A quick article on how iText began looking at GraalVM’s Native Image technology and our continuing development in this area. All because of a random meeting in SF.
Docling is a great project, happy to see more people building in the space.
Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:
- We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
- we run an ordering model, so ordering is better for docs where the PDF orde ris bad
- OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
- References and links
- Better equation conversion (soon including inline)
pdfHTML from iText does it (you can create PDF/UA-1, PDF/UA-2, and PDF/A documents from it), without using any external engine. You can even add your custom processing.
Commits that made it happen:
https://github.com/itext/itext-java/commit/71451319ebb9463d2...
https://github.com/itext/itext-java/commit/b50c34e14f012993f...
https://github.com/itext/itext-java/commit/b6b212971e285a4a8...
https://github.com/itext/itext-java/commit/0626cd422a275ac40...