> you could do this in JSON or some other data structure
I'm not sure you could. For example, in another comment, I mentioned DocBook[1]. How would you do the following sample document in JSON?
<?xml version="1.0" encoding="UTF-8"?>
<book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Very simple book</title>
<chapter xml:id="chapter_1">
<title>Chapter 1</title>
<para>Hello world!</para>
<img src="hello.jpg"/>
<para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
</chapter>
<chapter xml:id="chapter_2">
<title>Chapter 2</title>
<para>Hello again, world!</para>
</chapter>
</book>
Would you make each <chapter> into an object? But you have 2 <para> children in there with an <img> in between. And one <para> has an additional <emphasis> in the content. I can't think of a good JSON schema equivalent to this.
[
'this text nees more',
{'type': 'emphasis',
'children': ['emotion']},
'!'
]
hellish to write by hand but probably okay for a program to consume (modulo all the XML libs/tooling you can't use). and you could probably even write some kind of schema for it.
if i actually had to represent that data, i'd also move some child nodes into attributes, e.g. make all nodes with 'type': 'book' also have a 'title' attribute, like you would if you had an AST datatype
which is probably the only way to properly deal with markup and especially commented sections that can span over paragraph start/ends - neither JSON or XML seems to have a proper answer for such annotations and I wonder if there's any standard format that can that, especially if humans still want to reasonable be able to view or edit iit...
(OOXML and its binary equivalents more or less solve this by completely separating paragraph and character formatting, both separately indexing the spans of text they annotate)
That is what essentially every WYSIWYG text processor does. And also the reason why getting sane HTML out of text processor is somewhat non-trivial, as the separately indexed spans can very well overlap, contradict each other or contain completely unnecessary formatting information.
But as pointed out in the article, JSON isn't necessarily going to guarantee the correct order of your nested bits. Your code is going to have to worry about that. And it will quickly become unmanageably complex. When you are for instance creating a marked up transcript of some archival material, there's a lot of human editing involved. Have a look at the TEI documentation to see how messy it can get.
Certainly. I wasn't suggesting that JSON representation I put up there was actually a good idea, just that it's theoretically possible to represent that document as JSON.
I'm not sure you could. For example, in another comment, I mentioned DocBook[1]. How would you do the following sample document in JSON?
Would you make each <chapter> into an object? But you have 2 <para> children in there with an <img> in between. And one <para> has an additional <emphasis> in the content. I can't think of a good JSON schema equivalent to this.[1] https://en.wikipedia.org/wiki/DocBook#Sample_document