Re: Declaring 'reserved words' from C. M. Sperberg-McQueen on 2022-06-27 (public-ixml@w3.org from June 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 27 Jun 2022 09:37:22 -0600
To: John Lumley <john@saxonica.com>
Cc: public-ixml@w3.org
Message-ID: <87h74615jh.fsf@blackmesatech.com>

C. M. Sperberg-McQueen writes:
> ...
>
> An obvious approach that comes to mind would be to allow a grammar
> writer to assign a priority to the different top-level alts of a rule,
> with the meaning "If there is a choice between a parse using alt 1 and a
> parse using alt 2, for the same string, choose alt 1."
>
> ...
>
> But note that this does not completely solve the problem:  "BEGIN" and
> "END" are still accepted as names; they are just marked specially.
>
> So a simple priority scheme is not going to do the trick. Rats.  I had
> hopes for that.

From the point of view of notation, it occurs to me that the
language-subtraction notation used by the XML spec (namely F - G for two
expressions F and G) expresses very neatly what we want:

   { a name is a standard name which is not a keyword or an x-name,
     or else it is an x-name. }
   name = -standard-name - (keyword; x-name)
        ; x-name
        .

   { standard-name, x-name, and keyword are all straightforward. }
   standard-name = ALPHA, (ALPHA; DIGIT; "-")*.
   x-name = ["Xx"], "-", standard-name.
   keyword = "BEGIN"; "END".

If E = F - G, then if I remember my lesssons correctly:

  - If F and G are regular, E is regular (since regular languages are
    closed under subtraction).

  - If F is context-free and G is regular, E is context-free.

  - If F and G are context-free, E is not guaranteed context-free.

That last one is a killer: it would move ixml out of the context-free
languages business and into a broader field where some very cold winds
may blow.

And I suspect that ANY construct with the meaning "anything that matches
A but does not match B" is going to be problematic for the same reason.
Either you place restrictions on what B can represent, or you are not
guaranteed to be describing a context-free language.

Operationally, of course, it's simple to describe how a processor could
handle expressions like F - G:

  1 Try to recognize an F starting at the current position.
  
  2 If step 1 succeeds, try to recognize a G starting and ending at the
    same positions as the F just recognized.

    Otherwise (step 1 failed) nothing matches (F - G) at the current
    location.
    
  3 If step 2 succeeds, then we have both an F and a G covering the same
    section of the input.  That section of the input does not match (F -
    G).

    Otherwise (step 2 fails),  we have an F but not a G for that section
    of the input.  That section matches the expression (F - G).

I wonder whether a PEG parser could just rewrite the expression as (G,
fail; F) and be happy.

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Monday, 27 June 2022 15:37:39 UTC