XML Query: lexical analysis from Michael Dyck on 2001-06-17 (www-xml-query-comments@w3.org from June 2001)

From: Michael Dyck <MichaelDyck@home.com>
Date: Sun, 17 Jun 2001 12:55:38 -0700
To: www-xml-query-comments@w3.org
Message-ID: <3B2D0B3A.7647D47E@home.com>

XQuery 1.0: An XML Query Language
W3C Working Draft 07 June 2001

Lexical analysis of xqueries seems fraught with problems now. Basically, the
"lexical grammar" is ambiguous.

(1)
Keywords are a subset of NCName, which is a subset of QName. For example,
consider these three QueryModules:
    (a) for $x in //x return $x
    (b) namespace for = "http://www.example.com/whatever"
    (c) //for
In each case, the three letters "for" constitute a token, but in (a) it's a
keyword, in (b) it's an NCName, and in (c) it's a QName. So a would-be
tokenizer doesn't know what type of token it's got.

(2)
StringLiteral and AttributeValue generate (pretty much) the same set of
strings. For instance, consider these occurrences of "foo":
    (a) / = "foo"
    (b) <e a="foo" />
In (a) it's a StringLiteral; in (b) it's an AttributeValue. But things are
even worse, because StringLiteral is a terminal, whereas AttributeValue is a
non-terminal. So in (a), the 5 characters "foo" consitute a token, but in
(b) they constitute an AttributeValue containing 3 AttributeValueContents,
each of which is a Char.

For a worse example of this, consider:
    (c) / = "{ foo }"
    (d) <e a="{ foo }" />
In (c) it's a StringLiteral denoting a 7-character string. In (d) it's an
AttributeValue containing a single AttributeValueContent, which is an
EnclosedExpr, which contains (ultimately) the QName 'foo'. (Note that the
two space characters are discardable whitespace in (d), but not in (c).)

What is a would-be tokenizer to do? It seems that lexical analysis of XQuery
requires contextual feedback from the parser, which must be running in
parallel. This is an unwelcome complication, and one that is not supported
by all parsing software.

-Michael Dyck

Received on Sunday, 17 June 2001 16:02:28 UTC