Indirect left recursion in iXML grammars from Bethan Tovey-Walsh on 2026-02-08 (public-ixml@w3.org from February 2026)

From: Bethan Tovey-Walsh <bytheway@linguacelta.com>
Date: Sun, 8 Feb 2026 16:15:48 +0000
To: ixml <public-ixml@w3.org>
Message-Id: <E4D6B086-0C39-4AD8-A88C-F7D1D4594F35@linguacelta.com>
Dear iXMLers,

Has anybody worked out a way to rewrite an iXML grammar to eliminate hidden left recursion, but then use renaming, suppressions, etc. to retain the original grammar's XML output?

I suspect that this is impossible in the general case, but I'd love to hear that I'm wrong!

Consider this grammar:

	ant_string: ant_string?, "ant".

This recognizes the string "ant" repeated any number of times. If we parse the input "antant" with this grammar using an iXML parser, here's our output:

	<ant_string>
	   <ant_string>ant</ant_string>
	   ant
	</ant_string>

There is a direct left recursion in this grammar. The rule ant_string can match simply the string "ant", but it can also match itself followed by the string "ant". (We could also express this same grammar in iXML as: 

	ant_string: ant_string, "ant" ; "ant". 

which makes these two possibilities more explicit.)

Left recursion poses a problem for some types of parsing algorithm. Luckily, there's a simple algorithm to rewrite a grammar to eliminate direct left recursion. This would give us a new grammar:

	ant_string: "ant", ant_string_1.
	ant_string_1: "ant", ant_string_1 ; ().

We've moved the recursion from the left to the right; the rule for ant_string can no longer begin with ant_string.

However, since iXML uses the patterns in the grammar to structure the output XML, we can't just rewrite in this way. If we did, we'd get the wrong XML output from our input grammar. Using the replacement grammar to parse "antant":

	<ant_string>
	   ant
	   <ant_string_1>
	      ant
	      <ant_string_1/>
	   </ant_string_1>
	</ant_string>

In at least some cases, we can use renaming to solve this problem, although it requires rewriting to a slightly different grammar:

	ant_string: placeholder_1>ant_string, placeholder_2.
	placeholder_1:  "ant".
	-placeholder_2: () ; -placeholder_1, placeholder_2.

We have eliminated the left recursion, and used renaming and suppression to reconstruct the original parse structure.

However, I can't find a way to reconstruct the original parse structure when dealing with a grammar like this one, which has indirect left recursion:

	ant_string: bat_string, "ant".
	bat_string: "bat", cat_string ; ant_string, cat_string.
	cat_string: "cat".

We can parse the string "batcatantcatant" with this grammar to get:

	<ant_string>
	   <bat_string>
	      <ant_string>
	         <bat_string>
	            bat
	            <cat_string>cat</cat_string>
	         </bat_string>
	         ant
	      </ant_string>
	      <cat_string>cat</cat_string>
	   </bat_string>
	   ant
	</ant_string>

In this case, none of the individual rules is directly left recursive. However, the first element of ant_string is bat_string, and bat_string can begin with ant_string. Top-down parsing algorithms may have the same problem here as they have with direct left recursion: they go into an infinite loop trying to resolve ant_string, because they need to resolve bat_string in order to do so, but in order to resolve bat_string they must first resolve ant_string...

The algorithm to resolve indirect left recursion requires first rewriting the grammar so that any left recursions are direct, and then applying the algorithm to eliminate direct left recursion. So we first need something like this:

	ant_string: ant_string, cat_string, "ant"; bat_string, "ant".
	bat_string: "bat", cat_string.
	cat_string: "cat".

Now, the only left recursion is a direct left recursion in the rule for ant_string. We can now use the direct left recursion algorithm to get this:

	ant_string: bat_string, "ant", placeholder_1.
	placeholder_1: cat_string, "ant", placeholder_1 ; (). 
	bat_string: "bat", cat_string.
	cat_string: "cat".

Using this to parse "batcatantcatant", we get:

	<ant_string>
	   <bat_string>
	      bat
	      <cat_string>cat</cat_string>
	   </bat_string>
	   ant
   <placeholder_1>
	      <cat_string>cat</cat_string>
	      ant
	      <placeholder_1/>
	   </placeholder_1>
	</ant_string>

This is obviously quite different from the XML output achieved using the original grammar.

I may be missing something really obvious, but I can't work out a way to rewrite this grammar to eliminate left recursion, whilst also reconstructing the original XML structure.

If anyone has a solution, or even any questions or suggestions that might help me to work through this, I'd be immensely grateful. Alternatively, if you can prove that this is impossible for at least some grammars, I'd be grateful for that too, so that I can stop banging my head against this wall.

Very best,

Bethan
___________________________________________________ 
Dr. Bethan Tovey-Walsh 

linguacelta.com

Golygydd | Editor geirfan.cymru

Croeso i chi ysgrifennu ataf yn y Gymraeg.
Received on Sunday, 8 February 2026 16:16:18 UTC