> you could do this in JSON or some other data structure I'm not sure you could....

uryga · on Oct 29, 2019

you could always do an S-expression-esque DSL in JSON ;)

  ['book', {'id': '...'},
    ['title', {}, ...],
    ['chapter', {'id': 0},
      ['title', {}, 'Chapter 1'],
      ...
    ],
    ['chapter', {'id': 1},
      ...
    ],
  ]

more realistically, you could just represent it with the AST of that XML, i.e

  {
    'type': 'book',
    'attrs': {'id': ...},
    'children': [
      {
        'type': 'title',
        'children': ['Simple book']
      },
      {
        'type': 'chapter',
        ...
      },
      {
        'type': 'chapter',
        ...
      },
      ...
    ]
  }

so you could do that emphasis bit as

  [
    'this text nees more', 
    {'type': 'emphasis', 
     'children': ['emotion']},
    '!'
  ]

hellish to write by hand but probably okay for a program to consume (modulo all the XML libs/tooling you can't use). and you could probably even write some kind of schema for it.

if i actually had to represent that data, i'd also move some child nodes into attributes, e.g. make all nodes with 'type': 'book' also have a 'title' attribute, like you would if you had an AST datatype

unilynx · on Oct 29, 2019

See https://developers.google.com/docs/api/samples/output-json for what Google Docs does - basically separating markup from the text by using indices.

which is probably the only way to properly deal with markup and especially commented sections that can span over paragraph start/ends - neither JSON or XML seems to have a proper answer for such annotations and I wonder if there's any standard format that can that, especially if humans still want to reasonable be able to view or edit iit...

(OOXML and its binary equivalents more or less solve this by completely separating paragraph and character formatting, both separately indexing the spans of text they annotate)

dfox · on Oct 30, 2019

That is what essentially every WYSIWYG text processor does. And also the reason why getting sane HTML out of text processor is somewhat non-trivial, as the separately indexed spans can very well overlap, contradict each other or contain completely unnecessary formatting information.

bradstewart · on Oct 29, 2019

Potential option:

  {
    "id": "simple_book",
    "title": "Very simple book",
    "chapters": [
      {
        "id": "chapter_1",
        "content": [
          { "type": "title", "value": "Chapter 1" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "Hello World!" }
            ]
          },
          { "type": "img", "src": "hello.jpg" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "I hope that your day is proceeding " },
              { "type": "emphasis", "value": "splendidly" },
              { "type": "text", "value": "!" }
            ]
          }
        ]
      }
    ]
  }

Finnucane · on Oct 29, 2019

But as pointed out in the article, JSON isn't necessarily going to guarantee the correct order of your nested bits. Your code is going to have to worry about that. And it will quickly become unmanageably complex. When you are for instance creating a marked up transcript of some archival material, there's a lot of human editing involved. Have a look at the TEI documentation to see how messy it can get.

bradstewart · on Oct 29, 2019

Certainly. I wasn't suggesting that JSON representation I put up there was actually a good idea, just that it's theoretically possible to represent that document as JSON.