Re: The semantics of the disambiguation constructs from Bethan Tovey-Walsh on 2026-03-02 (public-ixml@w3.org from March 2026)

From: Bethan Tovey-Walsh <bytheway@linguacelta.com>
Date: Mon, 2 Mar 2026 17:11:36 +0000
To: John Lumley <john@saxonica.com>
Cc: ixml <public-ixml@w3.org>
Message-Id: <125D09E9-109B-4F89-9AF7-242727B0B979@linguacelta.com>

In that case, I think you can also use your ¬ for lookahead.

Example:

I have a long string made up of the characters a-z, and I want to parse out the names of plants or animals that start with "cat", including "cat" itself.

However, that means that I need to match "cat" if and only if it isn't part of one of the longer plant/animal names.

Using a negative lookahead, which I'll notate with ! for now, we could do this:

 things: (animal ; plant ; char)+.
 animal: "cat"!lookahead ; catanimal.
 plant: "catnip" ; "cattails" ; "catmint".
 -lookahead: "nip" ; "tails" ; "mint" ; "fish" ; "erpillar".
 -catanimal: "caterpillar" ; "catfish".
 -char: -["ab"; "d"-"z"] ; "c"!catstring.
 -catstring: "at" , lookahead?.

The string "cat" is only tagged as an animal if it is not followed by a string that would complete one of the longer animal or plant names. A "c" is only a char if it doesn't also begin any of the animal or plant names.

With your syntax, it seems to me that I could get the same result by doing this:

 things: (animal ; plant ; chars)+.
 animal: cat ¬ catname ; catanimal.
 plant: catplant.
 -cat: "cat", char_plus.
 -catname: catplant ; catanimal.
 -catplant: "catnip" ; "cattails" ; "catmint".
 -catanimal: "catfish" ; "caterpillar".
 -catstring: animal ; plant.
 -chars: char_plus ¬ catstring.
 -char_plus: -["a"-"z"]+.

Either way, parsing this input string:

diadshupcatasbiupfdacattailsasdhuopcatfishasdbhi

should give me:

 <things>
  <animal>cat</animal>
  <plant>cattails</plant>
  <animal>catfish</animal>
 </things>

I suspect there are some types of lookahead that would be harder (or impossible) to do this way, because of the constraint that, in 

 C: A ¬ B 

, C must be able to match the same span as B. That meant that I couldn't define the char nonterminal as a single character, as I did in the lookahead example.

BTW

___________________________________________________ 
Dr. Bethan Tovey-Walsh 

linguacelta.com

Golygydd | Editor geirfan.cymru

Croeso i chi ysgrifennu ataf yn y Gymraeg.

> On 2 Mar 2026, at 16:16, John Lumley <john@saxonica.com> wrote:
> 
> On 02/03/2026 16:09, Bethan Tovey-Walsh wrote:
>> Let's take this grammar fragment:
>> 
>> A: ["a"-"z"]*.
>> B: "cat" ; "bat" ; "rat".
>> C: A ¬ B.
>> 
>> I think I understand your view of the semantics to be this:
>> 
>> C is an A, unless the entirety of C also matches B, in which case it is a B
> No - it's effectively:
> C is an A, unless the entirety of C also matches B, in which case not a C
>> So if we had the input "caterpillar", we'd get:
>> 
>> <C>
>> <A>caterpillar</A>
>> </C>
>> 
> Yes
>> 
>> and if we had "cat", we'd get:
>> 
>> <C>
>> <B>cat</B>
>> </C>
>> 
> No - this would fail, as in the example where element() failed
>> 
>> and if we had "", we'd get:
>> 
>> <C>
>> <A/>
>> </C>
> Yes, as B would not (yet) have succeeded when A did at the end of input.
>> 
>> So, in the rule
>> 
>> C: A ¬ B.
>> 
>> we have something rather like
>> 
>> C: A | B.
>> 
>> in that C, if it matches, can be either an A or a B. The ¬ operator is simply a way to indicate that it cannot be *both* A and B.
> No - it is more like a set-reduction (difference) operator. 
> -- 
> John Lumley MA PhD CEng FIEE
> john@saxonica.com

Received on Monday, 2 March 2026 17:11:56 UTC