Re: Words than are not this word from C. M. Sperberg-McQueen on 2022-09-08 (public-ixml@w3.org from September 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Thu, 08 Sep 2022 17:33:49 -0600
To: Norm Tovey-Walsh <norm@saxonica.com>
Cc: Graydon Saunders <graydonish@gmail.com>, public-ixml@w3.org
Message-ID: <87illxbf9x.fsf@blackmesatech.com>

Norm Tovey-Walsh <norm@saxonica.com> writes:

>> Is there a way to disambiguate this and guarantee that each delete or
>> insert will start a block?
>
> In principle, you could create a rule that matches sequences of
> characters that are neither ‘d’, ‘e’, ‘l’, ‘e’, ‘t’, ‘e’ or ‘i’, ‘n’,
> ‘s’, ‘e’, ‘r’, ‘t’ but in practice I think that’d be much too (too!)
> large a combinatorial explosion.

For two keywords, I think it's doable.  What is required is that 'word'
be any non-empty string of acceptable characters that is not 'delete' or
'insert', right?  I'd suggest something like this:

  word = ~['di'; Zs], [L;P;Nd;Sc]*
       ; 'd', ~['e'; Zs], [L;P;Nd;Sc]*
       ; 'de', ~['l'; Zs], [L;P;Nd;Sc]*
       ; 'del', ~['e'; Zs], [L;P;Nd;Sc]*
       ; 'dele', ~['t'; Zs], [L;P;Nd;Sc]*
       ; 'delet', ~['e'; Zs], [L;P;Nd;Sc]*
       ; 'delete', [L;P;Nd;Sc]+
       ; 'i', ~['n'; Zs], [L;P;Nd;Sc]*
       ; 'in', ~['s'; Zs], [L;P;Nd;Sc]*
       ; 'ins', ~['e'; Zs], [L;P;Nd;Sc]*
       ; 'inse', ~['r'; Zs], [L;P;Nd;Sc]*
       ; 'inser', ~['t'; Zs], [L;P;Nd;Sc]*
       ; 'insert', [L;P;Nd;Sc]+
       .

If there is a real likelihood that the exlusions will match characters 
that should not be part of a word, then each 'word' element in the
output can be rescanned to make sure it's OK; otherwise, you may be able
to spare yourself the re-scanning.

On another note, I would make quoted strings a grammatical unit, to
avoid the risk of recognizing keywords within them.

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Thursday, 8 September 2022 23:47:13 UTC