Re: Document on Ambiguity from Bethan Tovey-Walsh on 2026-02-04 (public-ixml@w3.org from February 2026)

From: Bethan Tovey-Walsh <bytheway@linguacelta.com>
Date: Wed, 4 Feb 2026 20:07:04 +0000
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: ixml <public-ixml@w3.org>
Message-Id: <BE03DE12-8C1F-47BA-A899-8B4F8DB8E7B8@linguacelta.com>
Thanks for writing this up, Steven. In the spirit of "kicking off discussion", I have a couple of thoughts. This is going to be a bit long, but I'll start with a TL;DR summary which sets out three claims:

1. Ambiguity in iXML is a property of the grammar, not of the input. An input is ambiguous only at second hand, because it is understood as such according to the structure defined in the grammar.

2. There are good reasons to allow ambiguous grammars in iXML, beyond simple convenience. 

3. The aim of an iXML grammar author or user may be to identify ambiguity where it exists, rather than to produce a single, unambiguous parse of a given input.

I'll give my arguments in support of these claims in the rest of the email, but I'm going to start by defining my terms - some of the terminology of grammars and parsing is fairly new to me, and I guess that it may also be fairly new to some others on this list.

0. Defining terms

A grammar describes a language. The grammar is composed of a set of rules (a.k.a. "productions). The language defined by a grammar is composed of the valid sentences which can be constructed (a.k.a. "derived") using those rules.

Let's take this iXML grammar:

 S: subject, " ", verb, " ", object.
 subject: noun.
 object: noun.
 noun: "cats"; "dogs".
 verb: "like".

This grammar defines the language containing the following valid sentences:

 cats like cats
 cats like dogs
 dogs like dogs
 dogs like cats

Each of these sentences can be derived by following a path through the grammar, replacing each nonterminal name with the contents of its associated rule until we find a terminal symbol. This is a literal string, which cannot be replaced, and which must represents actual characters in the sentence.

Using the grammar, we can perform three common tasks. The first two involve the analysis of an input string:

- we can build a recognizer, which simply confirms whether or not the input string is a valid sentence in the language;

- we can build a parser, which is a tool that either fails (if the input is not a valid sentence), or assigns substrings of a valid sentence to parts of the nested structure defined in the grammar (e.g. by telling us that "cats" is a subject and "dogs" is an object in the second sentence I listed above). 

The third task takes only the grammar as input:

- we can build a generator, which will derive valid sentences belonging to the language defined by the grammar.

For a grammar like the one above, a generator could quickly produce all the sentences of the language. A grammar like the following, on the other hand, would cause a generator to keep running forever:

S: subject, " ", verb, " ", object.
subject: noun, (" ", clause)?.
object: noun, (" ", clause)?.
clause: pronoun, " ", verb, " ", object.
pronoun: "who".
verb: "like" | "hate".
noun: "cats" | "people".

The language defined by this grammar includes recursion, so it can theoretically keep adding elements to a sentence without reaching an end point:

 people like cats
 people who like cats like people
 people who like cats like cats who hate people
 cats who like people who hate cats hate people who like cats who hate cats who hate people
 ...

You get the picture. Strictly speaking, a language contains only the *finite* sentences derivable from the grammar. However, this grammar can generate arbitrarily long sentences so, in practice, a generator built from it could never stop if it were asked to produce every valid sentence. 

A parser should have no problem with this grammar, though, given a finite input string, because it only has to follow a recursive path for as long as the input requires it to. If we use an iXML processor to parse that fourth sentence against the grammar, we get:

<S>
   <subject>
      <noun>cats</noun> 
      <clause>
         <pronoun>who</pronoun> 
         <verb>like</verb> 
         <object>
            <noun>people</noun> 
            <clause>
               <pronoun>who</pronoun> 
               <verb>hate</verb> 
               <object>
                  <noun>cats</noun>
               </object>
            </clause>
         </object>
      </clause>
   </subject> 
   <verb>hate</verb> 
   <object>
      <noun>people</noun> 
      <clause>
         <pronoun>who</pronoun> 
         <verb>like</verb> 
         <object>
            <noun>cats</noun> 
            <clause>
               <pronoun>who</pronoun> 
               <verb>hate</verb> 
               <object>
                  <noun>cats</noun> 
                  <clause>
                     <pronoun>who</pronoun> 
                     <verb>hate</verb> 
                     <object>
                        <noun>people</noun>
                     </object>
                  </clause>
               </object>
            </clause>
         </object>
      </clause>
   </object>
</S>


1. Ambiguity is a property of the grammar.

Consider this grammar:

 S: noun_phrase ; sentence.
 noun_phrase: noun, " ", preposition, " ", name.
 sentence: noun, " ", verb, " ", name.
 noun: "cats" ; "people".
 verb: "like".
 preposition: "like".
 name: "Bob".

The language defined by this grammar consists of exactly two sentences:

 people like Bob
 cats like Bob

Each of these sentences can be derived in two different ways, so a parser built from this grammar cannot produce an unambiguous parse of either of the valid input strings it recognizes. Using an iXML processor, we get these two possible parses for the first sentence:

<S xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <noun_phrase>
      <noun>people</noun> 
      <preposition>like</preposition> 
      <name>Bob</name>
   </noun_phrase>
</S>

<S xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <sentence>
      <noun>people</noun> 
      <verb>like</verb> 
      <name>Bob</name>
   </sentence>
</S>

Here's another grammar:

 S: noun_phrase ; sentence.
 noun_phrase: noun_1, " ", preposition, " ", name.
 sentence: noun_2, " ", verb, " ", name.
 noun_1>noun: "people".
 noun_2>noun: "cats".
 verb: "like".
 preposition: "like".
 name: "Bob".

This grammar defines exactly the same language. However, in this case there is exactly one parse for each sentence:

<S>
   <noun_phrase>
      <noun>people</noun> 
      <preposition>like</preposition> 
      <name>Bob</name>
   </noun_phrase>
</S>

<S>
   <sentence>
      <noun>cats</noun> 
      <verb>like</verb> 
      <name>Bob</name>
   </sentence>
</S>

And here's a third grammar. Again, it defines exactly the same language:

 S: noun_phrase ; sentence.
 noun_phrase: noun_1, " ", preposition, " ", name.
 sentence: noun, " ", verb, " ", name.
 noun_1>noun: "people".
 noun_2>noun: "cats".
 noun: noun_1 ; noun_2.
 verb: "like".
 preposition: "like".
 name: "Bob".

This time, one of the sentences of the language can be derived in two different ways:

<S xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <noun_phrase>
      <noun>people</noun> 
      <preposition>like</preposition> 
      <name>Bob</name>
   </noun_phrase>
</S>
<S xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <sentence>
      <noun>
         <noun>people</noun>
      </noun> 
      <verb>like</verb> 
      <name>Bob</name>
   </sentence>
</S>

The other has a single derivation:

<S>
   <sentence>
      <noun>
         <noun>cats</noun>
      </noun> 
      <verb>like</verb> 
      <name>Bob</name>
   </sentence>
</S>

It isn't always possible to tell whether a grammar has multiple derivations for any, or all, of the sentences of the language it defines. But some grammars can only produce ambiguous parses for any of the sentences they recognize; some can only produce unambiguous parses; and some will produce ambiguous parses for a subset of the sentences they derive, and unambiguous parses for the rest.

It makes sense, therefore, to talk about an "ambiguous grammar", or a "potentially ambiguous grammar", or an "unambiguous grammar". But it doesn't make sense to talk about an "ambiguous input", unless we specify the grammar which is used to judge that ambiguity.

I think that some confusion here comes from our natural sense that some strings are inherently ambiguous. 

Think about the string "02-12-2020". Does it represent a U.S.-style date (twelfth of February, 2020) or a U.K.-style date (second of December, 2020)? We can't tell, at least without further context, so we naturally think that this is an inherently ambiguous string.

But it's only inherently ambiguous if we look at it with the pre-existing assumption that it is a date. And, if we do that, we are already parsing it against an informal grammar in our head, based on our almost unconscious cultural knowledge. If we wrote out the grammar in iXML, it would probably be something rather like:

 date: us_date ; uk_date.
 us_date: month, "-", day, "-", year.
 uk_date: day, "-", month, "-", year.
 day: "0", ["1"-"9"] ; ["12"], ["0"-"9"] ; "3", ["01"].
 month: "0", ["1"-"9"] ; "1", ["0"-"2"].
 year: ["12"], ["0"-"9"], ["0"-"9"], ["0"-"9"].

Against such a grammar, the string "02-12-2020" produces these parses:

<date xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <us_date>
      <month>02</month>-
      <day>12</day>-
      <year>2020</year>
   </us_date>
</date>

<date xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <uk_date>
      <day>02</day>-
      <month>12</month>-
      <year>2020</year>
   </uk_date>
</date>

But what if we already know that the string represents a UK-style date? In that case, it's entirely unambiguous to our internal grammar. We can reduce the iXML grammar to something like this:

 date: day, "-", month, "-", year.
 day: "0", ["1"-"9"] ; ["12"], ["0"-"9"] ; "3", ["01"].
 month: "0", ["1"-"9"] ; "1", ["0"-"2"].
 year: ["12"], ["0"-"9"], ["0"-"9"], ["0"-"9"].

And we'll get back a single parse:

<date>
   <day>02</day>-
   <month>12</month>-
   <year>2020</year>
</date>

But let's go further. What if that string isn't a date at all? What if it's the password for all my secret internet accounts? (It isn't.) In that case, we could have this grammar:

 password: character+.
 character: digit ; alpha ; special.
 alpha: ["a"-"z"].
 digit: ["0"-"9"].
 special: ["-&'^!?+=_@"].

(For the avoidance of confusion, I want to clarify that I have *no* secret internet accounts). 

In this case, parsing the string still results in an unambiguous parse, but with a completely different structure:

<password>
   <character>
      <digit>0</digit>
   </character>
   <character>
      <digit>2</digit>
   </character>
   <character>
      <special>-</special>
   </character>
   <character>
      <digit>1</digit>
   </character>
   <character>
      <digit>2</digit>
   </character>
   <character>
      <special>-</special>
   </character>
   <character>
      <digit>2</digit>
   </character>
   <character>
      <digit>0</digit>
   </character>
   <character>
      <digit>2</digit>
   </character>
   <character>
      <digit>0</digit>
   </character>
</password>

We could produce a very ambiguous result with a small change to the grammar:

 password: character+.
 character: digit ; alpha ; special.
 alpha: ["a"-"z"].
 digit: ["0"-"9"].
 special: ["-&'^!?+=_@"] ; ["0"-"9"].

There are 256 possible parses of "02-12-2020" with this grammar. Yet it's possible to cause the same grammar to produce an unambiguous result, for example with the string "c@pyb@r@":

<password>
   <character>
      <alpha>c</alpha>
   </character>
   <character>
      <special>@</special>
   </character>
   <character>
      <alpha>p</alpha>
   </character>
   <character>
      <alpha>y</alpha>
   </character>
   <character>
      <alpha>b</alpha>
   </character>
   <character>
      <special>@</special>
   </character>
   <character>
      <alpha>r</alpha>
   </character>
   <character>
      <special>@</special>
   </character>
</password>

By avoiding digits, we get an unambiguous parse. Digits cause ambiguity, not because there's anything inherently ambiguous about a digit, but because our grammar says that the digits 0-9 can either be a "digit" or a "special".

To summarize my arguments in support of my first claim:

- ambiguity is a question of the structures defined by a grammar;
- an input string will result in an ambiguous parse if and only if the grammar can derive that same string by constructing two different structural representations of it;
- an input has no pre-existing structure, and it therefore cannot be, in itself, ambiguous;
- however, given a specific grammar, we can say that an input string is ambiguous as regards that grammar, by which we mean that the grammar contains multiple derivations for that string. 

This is why I'm arguing that ambiguity, in iXML as elsewhere, is primarily a matter of the grammar, and only attaches to an input at second hand. The ambiguity of any given input string is an indirect ambiguity, identified when it is parsed against the grammar. The parse exposes multiple different ways of imposing an underlying structure on the same string, but those structures disappear as soon as we take the string on its own terms again.

2. There are good reasons to allow ambiguous grammars in iXML, beyond simple convenience.

The two reasons given by Steven for permitting ambiguity are, in simple terms: usability; and the fact that two different syntactic structures can, in iXML, result in the same XML output. 

I don't think this exhausts the reasons for allowing ambiguity in iXML. 

To give one possible example, as a linguist, I might be looking for examples of ambiguous English-language sentences. Let's say I write this grammar:

 sentence: subject, " ", verb_phrase.
 verb_phrase: art_verb, " ", (simple_object, " ", adverbial ; complex_object) ; non_art_verb, " ", complex_object.
 subject: "He" ; "She" ; "Eli"; "Sam".
 art_verb>verb: "drew" ; "sketched" ; "depicted".
 non_art_verb>verb: "saw" ; "recognized" ; "imagined".
 simple_object: ("the" ; "a"), " ", ("woman"; "girl"; "man"; "boy"; "person").
 adverbial: manner_word, " ", art_medium.
 complex_object: simple_object, " ", adverbial>object_modifier.
 manner_word: "with" ; "using".
 art_medium: "the charcoal" ; "the pencils" ; "the pastels" ; "the ink" ; "the marker pens".

(My apologies to the entire discipline of linguistics. This is not a linguistically-accurate way of labelling the structures I've identified, and I've skated over some complexities. Mea culpa.) 

I can now write a generator that produces valid sentences of the language defined by the grammar, like:

 He drew the man with the charcoal
 Sam imagined a person using the marker pens

(I haven't worked out how many sentences there are in the language. More than seven; fewer than infinity. Not great; not terrible.)

Now that I have the sentences, I can parse them, to figure out which are ambiguous and which are not. Essentially, this boils down to whether we have a verb of artistic creation, like "draw", or some other verb, like "recognize". In the former case, we either have someone using the art materials to draw someone else; or we have someone drawing another person, and that other person has the art materials. In the latter case, the art materials are unambiguously with the other person. You can't use charcoal to imagine someone, or use pastels to see someone (at least not unless we're in some kind of weird poetry thing, which I'm just assuming we're not for the purposes of this example).

The first sentence I just listed is ambiguous:

<sentence xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <subject>He</subject> 
   <verb_phrase>
      <verb>drew</verb> 
      <simple_object>the man</simple_object> 
      <adverbial>
         <manner_word>with</manner_word> 
         <art_medium>the charcoal</art_medium>
      </adverbial>
   </verb_phrase>
</sentence>

<sentence xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <subject>He</subject> 
   <verb_phrase>
      <verb>drew</verb> 
      <complex_object>
         <simple_object>the man</simple_object> 
         <object_modifier>
            <manner_word>with</manner_word> 
            <art_medium>the charcoal</art_medium>
         </object_modifier>
      </complex_object>
   </verb_phrase>
</sentence>

The second sentence is not ambiguous:

<sentence>
   <subject>Sam</subject> 
   <verb_phrase>
      <verb>imagined</verb> 
      <complex_object>
         <simple_object>a person</simple_object> 
         <object_modifier>
            <manner_word>using</manner_word> 
            <art_medium>the marker pens</art_medium>
         </object_modifier>
      </complex_object>
   </verb_phrase>
</sentence>

This leads me to my final point.

3. The aim of an iXML grammar author or user may be to identify ambiguity where it exists, rather than to produce a single, unambiguous parse of a given input.

The grammar I've just described, used as input to a generator, is a quick and easy way for me to come up with a large set of example sentences. I might want to do such a thing when I teach a class or give a talk. It also allows me to categorize the generated sentences as ambiguous or unambiguous quite easily. (Even an iXML parser which doesn't permit me to retrieve both of the parses in case of ambiguity will at least inform me that the parse is ambiguous.)

I argued above that there's no such thing as an "ambiguous input", in and of itself. However, it's still the case that we may be dealing with input data that are ambiguous, or potentially so, with regards to the structure we have expressed in our iXML grammar. This ambiguity may be an annoyance, as with ambiguous dates, where it is not possible to know for certain whether "02" means "second" or "February". But, in other cases, the ambiguity is a feature, not a bug. Ambiguity isn't always something we want to make go away. For a linguist, for example, it's often the actual focus of her interest.

I hope this successfully illustrates one potential use case for an ambiguous grammar where the ambiguity is not simply a way to making the grammar easier to write, and where the aim is not to hide that ambiguity in the output. I'd be very surprised if there weren't more possible use cases out there.

Regards,

BTW
___________________________________________________ 
Dr. Bethan Tovey-Walsh 

linguacelta.com

Golygydd | Editor geirfan.cymru

Croeso i chi ysgrifennu ataf yn y Gymraeg.

> On 3 Feb 2026, at 13:49, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> Thanks for your comments. I'm not sure whether this document is a living one, or just to kick off a discussion, but I will bear your remarks in mind should I update it.
> 
> Best wishes,
> 
> Steven
> 
> On Tuesday 20 January 2026 13:34:11 (+01:00), Graydon Saunders wrote:
> 
>  > _It is a potential source of technical debt._
> 
> For a document otherwise written in (impressively!) simple language, hauling in "technical debt", a term with a broad variety of unrecognized variation in meaning, does not seem desirable.
> 
> I perceive ambiguity as arising from some combination of "you don't know what you want your grammar to do" and "you don't know how to make a grammar that does what you want" (which has subcases of "a grammar can't do that" and "learn more about grammar mechanisms") and it might be useful to make all of these explicit in the text.
> 
> On Tue, Jan 20, 2026, at 07:06, Steven Pemberton wrote:
>> 
>> 
>> I had an action to produce a document addressing ambiguity, and here it is.
>> 
>> Steven
>> 
>> Attachments:
>>     • ambiguity.html
> 
>
Received on Wednesday, 4 February 2026 20:07:24 UTC