I am using XML ===XSLT===> LaTeX ===pdflatex/lualatex===> PDF for more than a de...

froh · on Oct 10, 2022

which XML did you go with? docbook? Dita? some homegrown?

Archelaos · on Oct 10, 2022

A quite simple homegrown DTD. Making it as simple as possible keeps the complexity of the XSLT scripts low. Most of it is similar to HTML (<paragraph>, <italics>, <bold>) with a few special attributes to add some semantics or processing hints.[1] Sometimes I am using several intermediated steps to produce the final HTML. It might then be useful to version the DTDs with a fixed REQUIRED version attribute in the root element, that must to appear in the XML files, to avoid applying a wrong XSLT script to an outdated version of my XML sources.[2]

For a customer who needed to import large semi-structured legacy Word documents from another company into a database system, I once implemented the following process: The Word documents were converted to a relatively simple homegrown XML format based on the structural elements of the Word document. The resulting XML documents were manually corrected where the structural elements were incorrect. Some special attributes were added inside the XML documents to associate text passages with already existing database keys. When this was finished, an XSLT script was applied that split the large XML file into smaller ones based on this database keys; a human readable prefix, the key and a date went into the file name. These files were converted in bulk to LaTeX and then to PDF. Afterwards, I used a little tool to bulk upload only the fresh PDFs into the correct database entries based on the keys in their filenames.

For one of my side-projects, a C# application, I am using another, object-oriented approach, where I have an abstract base class for reporting and two derived classes, one that outputs HTML and one that outputs LaTeX. The LaTeX output is then fed into lualatex to produce PDFs. You can check out the free Herodotus edition of my (closed-source) Factonaut project at https://www.factonaut.com/ to see it in action.

[1] Using parameter entities for re-usability, such as

    <!ENTITY % output_attr SYSTEM "output_attr.ent">
    <!ATTLIST foo %output_attr; >
    <!ATTLIST bar %output_attr; >

in the DTD referring to an `output_attr.ent` file with the following contents:

    output (pdfonly|htmlonly|all) "all"

[2] The declaration in the DTD looks like:

    <!ATTLIST root version (1.0) #REQUIRED >

and the XML must then look like:

    <root version="1.0"> ... </root>