Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looks nice. What I'd really like to see, along these lines, is a python library for automated document metadata extraction with confidence assessment, like this:

./autometa.py --author --verbose academic-paper.pdf

Author: "Edward Witten" Confidence: High (matches template "amslatex")



I thought about the metadata thing but decided to exclude it for the earliest versions of textract to keep things simple. If you'd like to see it in there and have a good example of how you'd like to use metadata, please feel free to throw an issue on the issue tracker https://github.com/deanmalmgren/textract/issues/


As far as I have been able to tell, the public state of the art in academic paper metadata parsing is Grobid: https://github.com/kermitt2/grobid

Not quite as simple a commandline interface as you suggest, but not too hard to set up, and pretty impressive. Now if only Google Scholar would open-source whatever they use...


For video files, guessit does something similar using only the file name:

http://guessit.readthedocs.org/




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: