Re: A grammar challange regarding that pesky YAML format

One of the big regular complaints about XML is that it is "difficult to
parse" and yet YAML does not suffer from that complaint in spite of its
very dodgy parsing model.

This document gives some excellent examples of YAML failures.

https://ruuda.nl/2023/the-yaml-document-from-hell



On Sat, May 16, 2026, 12:15 Fredrik Öhrström <oehrstroem@gmail.com> wrote:

> I have now spent more time than is really healthy on the problem of
> parsing yaml files.
>
> There is now an ixml grammar here that can parse a subset of yaml:
> You can curl it from here: http://libxmq.org/library/conf/yamlsubset.ixml
> Or view it here:
> https://github.com/libxmq/xmq/blob/master/library/conf/yamlsubset.ixml
>
> It can parse xmq's own xmq/.github/workflows/build_ubuntu.yml file.
>
> https://github.com/libxmq/xmq/blob/master/.github/workflows/build_ubuntu.yml
>
> Nice, but this yml file is pretty printed and uses only a very very
> very very small subset of yaml.
>
> We already know parsing requires multiple copies of the same rule for
> each indentation.
>
> But it seems that for full parsing we also need several other set of
> rules for each indentation each containing all possible future
> indentations.
>
> I.e. if python requires a O(C*n) grammar rules (where n is the deepest
> indentation), yaml will require rules that grow (in size, or nr) with
> O(C*n*n).
>
> Why is this? Well, let me give some background.
>
> In yaml you can write a context free compact form similar to json in.
> { alfa: 111, beta: 222 }
> I.e. an unnamed object with two key-values. This translates to json:
> { "alfa": 111, "beta": 222 }
> and if you translate the json to xmq with:
> echo '{ "alfa": 111, "beta": 222 }' | xmq to-xmq --compact
> you get:
>
> _{alfa=111 beta=222}
>
> which is equivalent to this xml:
>
> echo '{ "alfa": 111, "beta": 222 }' | xmq to-xml
>
> <_><alfa>111</alfa><beta>222</beta></_>
>
> Here I use the xmq way of transforming json into xml.
>
> Anyway, if YAML only was this, it would be a piece of cake (context free
> cake).
> Alas this is what it looks like normally:
>
> alfa: 111
> beta: 222
>
> (ie no indentation, an implicit root object is created, _ i xml.)
> Ok, then how to do an inner object, ie json: {"foo: { "alfa": 111, "beta":
> 222}}
>
> foo:
>   alfa: 111
>   beta: 222
>
> A key value in YAML is something colon space something, ie colon
> followed by space is a reserved keyword. The rule val in the grammar
> will not parse text containing a colon space.  (This is the answer to
> my first question in this email chain. In retrospect the simplest
> problem in the whole soup.)
>
> It can be hard to see in this email, there are two spaces of
> indentation.  But it does not need to be 2, it can be 1, it can be
> 7. As long as they are the same for alfa and beta and greater than the
> previous indentation.  My grammar can only handle current indentation
> + 2. Which happens to be the pretty printed standard for yaml.
>
> In the grammar this jump in indentation is handled in line 15, 29, 45,
> 61 where for level 0, the next block is block1, for level 1 the next
> block is block 2 etc.  I am skipping half of the indentations since
> level 1 is two spaces, not one space.
>
> To handle all possible indentations block1? on line 15 must be
> replaced with block1ormore?. Defined as:
> block1ormore=blocka1|block2|block3|block4.  and block2? on line 29 be
> replaced with block2ormore=block2|block3|block4.  etc etc.
>
> Ok, so the size of these rules is now growing with approx  (n*(n+1)/2
>
> Still, this is sort of manageable, to handle 16 levels of indentation
> we have need 16 rules with a total of 120 rule parts.  Then you need
> the normal 16*12 per rules for each indentation level (I only have
> 4*12 in this example.)  Alas, now let me tell you about arrays.....
>
> Arrays in json and yaml and several other configuration languages, do
> not exist in xml since everything in xml are arrays.... ie there is
> always an ordering in xml (except for attributes, but the
> implementations still offer an ordering there as well but prevent
> duplicate keys.)
>
> The json {"foo":[1,2,3]} can be compact YAML with {foo:[1,2,3]} or
> context sensitive like this:
>
> foo:
>   - 1
>   - 2
>   - 3
>
> When json is converted to xml using xmq each array element is an
> unname/untype object named undersore.
>
> echo '{"foo":[1,2,3]}'  | xmq to-xmq --compact
> _{foo(A){_=1 _=2 _=3}}
>
> or if to-xml is used:
> <_><foo A=""><_>1</_><_>2</_><_>3</_></foo></_>
>
> If you now start with keyvalues in an array element, then a new object
> per array element is implicitly created surrounding the key values.
>
> foo:
>   - a: 1
>     b: 2
>   - c: 3
>     d: 4
>
> is equivalent to {"foo":[{"a":1,"b":2},{"c":3,"d":4}]}
> or in xmq: _{foo(A){_{a=1 b=2}_{c=3 d=4}}}
> or in xml: <_><foo
> A=""><_><a>1</a><b>2</b></_><_><c>3</c><d>4</d></_></foo></_>
>
> Ok, fine, BUT the indentation can be arbitrary here as well. So this is an
> identical yaml:
>
> foo:
>   -   a: 1
>       b: 2
>   -      c: 3
>          d: 4
>
> Ok, so now we have to count the spaces after the "- " ie, this is the
> other keyword (hyphen space).  so as to know which indentation block
> to jump into in the grammar. I.e. we have to add two numbers the
> indentation
> to the hyphen and the indentation to the key, eg "a:".
> We need to now this indentation to know which following lines belong to
> this object.
>
> So it would seem we need 15*15 rule parts again, to sum two indents up to
> a maximum of 16.
>
> It gets worse, for some reason you are allowed to start a new array on the
> same line
> as an array hyphen. This is permitted:
>
>  -       -   - 1
>          -     2
>          - 3
>  - 4
>
> And generates this json: [[[1],2,3],4]
> (Note that I indented the 2 with spaces that does not matter.)
>
> Ok, so not only do we need to add two indentations, we need to add any
> number, to fit
> within 16 characters and still have a single char for content, that will
> require 9 additions.
>
> - - - - - - - - 1
>
> generates this json [[[[[[[[1]]]]]]]]
> which through xmq to-xml becomes:
> <_ A=""><_ A=""><_ A=""><_ A=""><_ A=""><_ A=""><_ A=""><_
> A=""><_>1</_></_></_></_></_></_></_></_></_>
>
> (The A is the attribute that indicates that if you convert this xml back
> to json, it will use an array here.)
>
> echo '[[[[[[[[1]]]]]]]]' | xmq to-xml | xmq to-json
>
> I think the additions can be reused the but you will still need a lot of
> rules.
> You add 2+4 and get to indentation 6, then you can add 2 and get to 8.
> If you add 4+2 (using a third rule) and get to 6, then the same rule can
> be reused to add 2 and get to 8.
>
> I can see that the imperative source code for parsing yaml is not that
> hard.
> Just keep track of the indentation for the current character, if you reach
> a hyphen,
> push the token LIST+indent on the stack. When you read the next line, find
> the indent, if it is
> the same as the stack top, keep adding to the same list. If the
> indentation is less, then pop the
> stack.
>
> I have now convinced myself that I will not spend more time trying to
> parse yaml.
> There is a lot more yaml language "features" that I have not yet covered.
>
> Perhaps someone else will pick up the mantle and for a theoretical result
> just try to
> parse the recursive arrays of arrays, nothing else. I.e. yaml files only
> containing:
> space, newline, hyphen and digits.
>
> Cheers!
>
> //Fredrik
>

Received on Saturday, 16 May 2026 10:54:48 UTC