Re: Declaring 'reserved words' from C. M. Sperberg-McQueen on 2022-06-27 (public-ixml@w3.org from June 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 27 Jun 2022 08:46:33 -0600
To: John Lumley <john@saxonica.com>
Cc: public-ixml@w3.org
Message-ID: <87k09217w6.fsf@blackmesatech.com>
John Lumley writes:

> In working thought the ixml grammar for XPath I of course have
> potential ambiguity between 'element()' as a node kind test and
> 'element()' as a function call. In the XPath spec 'element' (and the
> like) is a reserved name for function calls. I /think/ we cannot
> express such a reservation in iXML, but am not entirely sure, and I'll
> have to live with the ambiguity (and its concommitant ambiguity
> complexity).

I believe that that is true.  The same issue arises in the Oberon
grammar we just added to the samples directory, for the keywords NIL,
TRUE, and FALSE.  (It seems not to arise for other keywords since they
cannot appear in places where a variable might appear.)

And for that matter it also arises in the vCard grammar Dave Pawson and
I just wrote, for which see Dave's inquiry on the xsl-list [1] and the
ensuing thread; to cut to the chase and see the grammar, go to [2].

[1] https://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/202206/msg00098.html
[2] https://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/202206/msg00115.html

One point of potential interest in that grammar is that it writes around
the ambiguity by defining a 'name' as a sequence of letters, digits, or
hyphens which (a) begins with a letter, (b) does not begin with 'X-' or
'x-', (c) does not begin with the string 'BEGIN", and (d) does not begin
with the string "END".

                         { In principle name could be very simple.
                           But we want to distinguish normal names
                           from x-names, and we want to ensure that
                           BEGIN and END are not recognized as names
                           but as keywords. So we have a more complicated
                           definition. }
                 @name = not-an-x-name
		       | not-begin
		       | not-end
		       | normal-name
                       | x-name
                       .

                         { not-an-x-name, though it begins with X }
        -not-an-x-name = ["Xx"], (~["-"], (ALPHA | DIGIT | "-")*)?.
	
                         { not-begin, though it begins with B... }
            -not-begin = "BEGI", (~["nN"], (ALPHA | DIGIT | "-")*)?
	               | "BEG", (~["iI"], (ALPHA | DIGIT | "-")*)?
	               | "BE", (~["gG"], (ALPHA | DIGIT | "-")*)?
	               | "B", (~["eE"], (ALPHA | DIGIT | "-")*)?
		       .
		       
                         { not-end, though it begins with E or EN }
              -not-end = ["Ee"], ["Nn"], (~["Dd"], (ALPHA | DIGIT | "-")*)?
	               | ["Ee"], (~["Nn"], (ALPHA | DIGIT | "-")*)?
		       .
		       
                         { normal-name: does not look like x-name, 
                           begin, or end at any point }
          -normal-name = ~["XxBbEe"], (ALPHA | DIGIT | "-")*.



> Any enlightenment would be appreciated

First, contemplate the sound of five regular expressions clapping.

Since regular languages are closed under set difference, it must be
theoretically possible to define a nonterminal that recognizes anything
in a particular regular set with the exception of reserved words.  In
the case of the vCard grammar, the task was simple enough to do by hand,
but the pattern is simple enough that I suppose it might be automatable.

Even if it's automatable, however, very few people are going to be
willing to contemplate either the task or the result.

So I think it would be nice to find a way to handle grammars with
reserved words or ambiguities of the element() / element() sort.

An obvious approach that comes to mind would be to allow a grammar
writer to assign a priority to the different top-level alts of a rule,
with the meaning "If there is a choice between a parse using alt 1 and a
parse using alt 2, for the same string, choose alt 1."

So instead of the definitions above, 'name' could be defined using
priorities to prefer

                 @name = {10} reserved-word
                       | {5} extension
                       | {1} ALPHA, (ALPHA|DIGIT|"-")+.
             extension = ["xX"], "-", (ALPHA|DIGIT|"-")+.
         reserved-word = "BEGIN"; "END".

But note that this does not completely solve the problem:  "BEGIN" and
"END" are still accepted as names; they are just marked specially.

So a simple priority scheme is not going to do the trick. Rats.  I had
hopes for that.

(What was that noise?  It sounded like five regular expressions falling
on the floor in a heap.)

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Monday, 27 June 2022 14:46:58 UTC