Re: A grammar challange regarding that pesky YAML format from Fredrik Öhrström on 2026-05-16 (public-ixml@w3.org from May 2026)

From: Fredrik Öhrström <oehrstroem@gmail.com>
Date: Sat, 16 May 2026 12:15:08 +0200
To: Alan Painter <alan.painter@gmail.com>
Cc: Peter Flynn <peter@silmaril.ie>, public-ixml@w3.org
Message-ID: <CALZT+jCbr=Uf3NivV=z2BDB9sErG-7S469SLg0EzX3JQtB1boA@mail.gmail.com>
I have now spent more time than is really healthy on the problem of
parsing yaml files.

There is now an ixml grammar here that can parse a subset of yaml:
You can curl it from here: http://libxmq.org/library/conf/yamlsubset.ixml
Or view it here:
https://github.com/libxmq/xmq/blob/master/library/conf/yamlsubset.ixml

It can parse xmq's own xmq/.github/workflows/build_ubuntu.yml file.
https://github.com/libxmq/xmq/blob/master/.github/workflows/build_ubuntu.yml

Nice, but this yml file is pretty printed and uses only a very very
very very small subset of yaml.

We already know parsing requires multiple copies of the same rule for
each indentation.

But it seems that for full parsing we also need several other set of
rules for each indentation each containing all possible future
indentations.

I.e. if python requires a O(C*n) grammar rules (where n is the deepest
indentation), yaml will require rules that grow (in size, or nr) with
O(C*n*n).

Why is this? Well, let me give some background.

In yaml you can write a context free compact form similar to json in.
{ alfa: 111, beta: 222 }
I.e. an unnamed object with two key-values. This translates to json:
{ "alfa": 111, "beta": 222 }
and if you translate the json to xmq with:
echo '{ "alfa": 111, "beta": 222 }' | xmq to-xmq --compact
you get:

_{alfa=111 beta=222}

which is equivalent to this xml:

echo '{ "alfa": 111, "beta": 222 }' | xmq to-xml

<_><alfa>111</alfa><beta>222</beta></_>

Here I use the xmq way of transforming json into xml.

Anyway, if YAML only was this, it would be a piece of cake (context free
cake).
Alas this is what it looks like normally:

alfa: 111
beta: 222

(ie no indentation, an implicit root object is created, _ i xml.)
Ok, then how to do an inner object, ie json: {"foo: { "alfa": 111, "beta":
222}}

foo:
  alfa: 111
  beta: 222

A key value in YAML is something colon space something, ie colon
followed by space is a reserved keyword. The rule val in the grammar
will not parse text containing a colon space.  (This is the answer to
my first question in this email chain. In retrospect the simplest
problem in the whole soup.)

It can be hard to see in this email, there are two spaces of
indentation.  But it does not need to be 2, it can be 1, it can be
7. As long as they are the same for alfa and beta and greater than the
previous indentation.  My grammar can only handle current indentation
+ 2. Which happens to be the pretty printed standard for yaml.

In the grammar this jump in indentation is handled in line 15, 29, 45,
61 where for level 0, the next block is block1, for level 1 the next
block is block 2 etc.  I am skipping half of the indentations since
level 1 is two spaces, not one space.

To handle all possible indentations block1? on line 15 must be
replaced with block1ormore?. Defined as:
block1ormore=blocka1|block2|block3|block4.  and block2? on line 29 be
replaced with block2ormore=block2|block3|block4.  etc etc.

Ok, so the size of these rules is now growing with approx  (n*(n+1)/2

Still, this is sort of manageable, to handle 16 levels of indentation
we have need 16 rules with a total of 120 rule parts.  Then you need
the normal 16*12 per rules for each indentation level (I only have
4*12 in this example.)  Alas, now let me tell you about arrays.....

Arrays in json and yaml and several other configuration languages, do
not exist in xml since everything in xml are arrays.... ie there is
always an ordering in xml (except for attributes, but the
implementations still offer an ordering there as well but prevent duplicate
keys.)

The json {"foo":[1,2,3]} can be compact YAML with {foo:[1,2,3]} or
context sensitive like this:

foo:
  - 1
  - 2
  - 3

When json is converted to xml using xmq each array element is an
unname/untype object named undersore.

echo '{"foo":[1,2,3]}'  | xmq to-xmq --compact
_{foo(A){_=1 _=2 _=3}}

or if to-xml is used:
<_><foo A=""><_>1</_><_>2</_><_>3</_></foo></_>

If you now start with keyvalues in an array element, then a new object
per array element is implicitly created surrounding the key values.

foo:
  - a: 1
    b: 2
  - c: 3
    d: 4

is equivalent to {"foo":[{"a":1,"b":2},{"c":3,"d":4}]}
or in xmq: _{foo(A){_{a=1 b=2}_{c=3 d=4}}}
or in xml: <_><foo
A=""><_><a>1</a><b>2</b></_><_><c>3</c><d>4</d></_></foo></_>

Ok, fine, BUT the indentation can be arbitrary here as well. So this is an
identical yaml:

foo:
  -   a: 1
      b: 2
  -      c: 3
         d: 4

Ok, so now we have to count the spaces after the "- " ie, this is the
other keyword (hyphen space).  so as to know which indentation block
to jump into in the grammar. I.e. we have to add two numbers the indentation
to the hyphen and the indentation to the key, eg "a:".
We need to now this indentation to know which following lines belong to
this object.

So it would seem we need 15*15 rule parts again, to sum two indents up to a
maximum of 16.

It gets worse, for some reason you are allowed to start a new array on the
same line
as an array hyphen. This is permitted:

 -       -   - 1
         -     2
         - 3
 - 4

And generates this json: [[[1],2,3],4]
(Note that I indented the 2 with spaces that does not matter.)

Ok, so not only do we need to add two indentations, we need to add any
number, to fit
within 16 characters and still have a single char for content, that will
require 9 additions.

- - - - - - - - 1

generates this json [[[[[[[[1]]]]]]]]
which through xmq to-xml becomes:
<_ A=""><_ A=""><_ A=""><_ A=""><_ A=""><_ A=""><_ A=""><_
A=""><_>1</_></_></_></_></_></_></_></_></_>

(The A is the attribute that indicates that if you convert this xml back to
json, it will use an array here.)

echo '[[[[[[[[1]]]]]]]]' | xmq to-xml | xmq to-json

I think the additions can be reused the but you will still need a lot of
rules.
You add 2+4 and get to indentation 6, then you can add 2 and get to 8.
If you add 4+2 (using a third rule) and get to 6, then the same rule can be
reused to add 2 and get to 8.

I can see that the imperative source code for parsing yaml is not that hard.
Just keep track of the indentation for the current character, if you reach
a hyphen,
push the token LIST+indent on the stack. When you read the next line, find
the indent, if it is
the same as the stack top, keep adding to the same list. If the indentation
is less, then pop the
stack.

I have now convinced myself that I will not spend more time trying to parse
yaml.
There is a lot more yaml language "features" that I have not yet covered.

Perhaps someone else will pick up the mantle and for a theoretical result
just try to
parse the recursive arrays of arrays, nothing else. I.e. yaml files only
containing:
space, newline, hyphen and digits.

Cheers!

//Fredrik
Received on Saturday, 16 May 2026 10:15:40 UTC