XQuery lexical stuff

XQuery 1.0: An XML Query Language
W3C Working Draft 20 December 2001

Here are some comments on A.3 Lexical structure.

------------------------------------------------------------------------

para 1:
"Whitespace may be freely added within patterns"
    What do you mean by "patterns"?  Presumably, you're either talking
    about adding the symbol 'Whitespace' to grammar productions, or
    adding whitespace (i.e., sequences of characters) to queries. Don't
    confuse the two.

"before or after any token"
    But you never actually define what a token is. It's not even clear
    what the set of token-types is. (Is it the set of left-hand-sides of
    productions 75 through 216? Is it the set of symbols that appear in
    the "tokens" column of the TRANITION STATES table? The two are
    different, and both contain symbols that probably shouldn't be
    considered token-types.)

para 1 and bullets 1 and 2:
    Note that the Whitespace symbol derives the empty string, but
    phrases like "must always be followed by whitespace" and
    "whitespace may not occur" obviously mean "whitespace" in the sense
    of "a non-empty string of whitespace-characters".  I think this
    shows correct usage, and there's no reason for Whitespace to be
    nullable. (That is, it should be the same as S.)

bullet 3:
"A space"
    We're interested in whitespace, not just a space.

"may be significant"
    Don't tell us that it *may* be significant. Tell us exactly when it
    *is* significant.

para 2:
"Tokens may be often only recognized"
    "may be often only" is clunky.

"in a specific state"
    You haven't defined states yet.

"within the evaluation":
    Does evaluation of a query include its parsing/lexing?

"may cause the grammar to transition to a different state"
    Grammars don't have states or transitions. Automata do.

"following the enumeration of tokens"
    Change "tokens" to "token-types".

para 3:
"When tokenizing, the longest possible token is always returned"
    Issue 109 says this means "the longest sequence that would form a
    token in the token-space of the grammar, not the longest that would
    be valid in the current syntactic context."  Does it?

"If there is an ambiguity between two tokens, ..."
    Presumably, you mean an ambiguity that isn't resolved by the
    longest-match rule.

"the token that an lower grammar number"
    Change "an" to "a".

"is more specific than"
    Why do we care which is "more specific"? We want to know which is
    the right one. I'll assume that's what you mean.

I'm very suspicious of this kind of blanket rule. You're liable to shoot
yourself in the foot. In fact, I believe you have:
 -- S [78] precedes Whitespace [210], so WhitespaceChar+ will always
    tokenize as S rather than Whitespace.
 -- Nmstart [123] precedes Nmchar [124], Char [208], Letter [211], and
    BaseChar [212], so [a-zA-Z_] will always tokenize as Nmstart.
 -- Digits [166] precedes HexDigits [193], so [0-9]+ will always
    tokenize as Digits.
 -- Char [208] precedes WhitespaceChar [209], so #x9, #xA, #xD, and #x20
    will always tokenize as Char rather than WhitespaceChar.
(Personally, I'm not sure I'd consider any of these to be token types,
but they're all in the "tokens" column of the Transition States table.)

In fact, here are the only cases I could find where this rule makes a
sensible ruling:
    ELEMENT_CONTENT:        prefer Lbrace and StartTagOpen over Char.
    APOS_ATTRIBUTE_CONTENT: prefer Lbrace and CloseApos over Char.
    QUOT_ATTRIBUTE_CONTENT: prefer Lbrace and CloseQuot over Char.
    DEFAULT:                prefer reserved word over QName.
Of these, the first three could be handled explicitly in the grammar
(e.g., in the definition of ElementContent, replace Char with (a symbol
denoting) [^{<]).  The last could be handled with a specific rule.

para 5:
"ExprComment tokens should be ignored by the parser."
    Ignored in what sense? Consider
        foo{-- comment --}bar
    If we completely ignore the comment, does the parser see
        foobar
    or does the comment function as whitespace? If the latter, then
    I suggest the following changes:
        [78]  S          ::= (WhitespaceChar | ExprComment)+
        [210] Whitespace ::= (WhitespaceChar | ExprComment)*

--------------------------
TERMINALS

Where are productions [73] and [74]?

Productions that are not used elsewhere in the grammar:
     [77] ExprComment   (appears in transition table & body text)
    [106] Before
    [107] After
    [121] Ref
    [126] ColonStar
    [176] SemiColon
    [177] Colon
    [210] Whitespace    (appears in transition table & body text)

[77] ExprComment
    This seems like a poor name for the symbol, given that it's not a
    kind of expression.

    By the way, why did you drop single-line comments (# to line-end)?

[87] DefineFunction:
    Is DefineFunction a token type? If the parser finds text matching
    the DefineFunction production, is that one token, or two? (Or three,
    if we count the intervening whitespace as well?)  I think you'd be
    better off with the conventional view, that 'define' and 'function'
    are two separate tokens. So I suggest the following changes:
        [70] FunctionDefn ::= Define Function QName ...
        [87] Define       ::= "define"
    (Note that
        [115] Function    ::= "function"
    already exists.)

Similarly for:
    [81] AxisChild
    [82] AxisDescendant
    [83] AxisParent
    [84] AxisAttribute
    [85] AxisSelf
    [86] AxisDescendantOrSelf
    [112] Instanceof
    [119] ElementOfType
    [163] CastAs
    [164] AssertAs
    [165] TreatAs

[169] DoubleLiteral
    Change ([e]|[E]) to [eE].
    Change ([+]|[-]) to [+-].

[189] QName
    The production allows tokens such as ':foo:bar', whereas 2.1 only
    says that an initial colon is allowed on "unprefixed QNames".

[193] HexDigits
    Change ([0-9]|[a-f]|[A-F]) to [0-9a-fA-F].

[200] ValueIndicator
    Why introduce ValueIndicator? Why not just have AttributeList [64]
    use Equals [131]?

[208] Char
    The outermost parentheses are unnecessary. Similarly for:
    [209] WhitespaceChar
    [212] BaseChar
    [213] Ideographic
    [214] CombiningChar
    [215] Digit
    [216] Extender

[210] Whitespace
    It's pointless to have a token type that derives the empty string,
    because the "longest possible token" rule guarantees that (in
    well-formed queries) the returned token will never be empty.  So you
    might as well change the '*' to a '+', in which case you have the
    same right-hand-side as S [78], so you should probably merge the two
    productions/symbols/concepts.

--------------------------

A.3.1 Lexical States

para 5:
"To allow curly braces to be used as character content, a double left
or right curly brace is interpreted as a single curly brace character."
    An alternative would be to use character references (e.g., &#x7b;
    and &#x7d;).

para 7:
"An operator that immediately follows a "/" or "//" when used as a root
symbol, should not parse."
    Why not? You don't give any reason.
    The example given, / * foo, is unambiguous:

            MultiplicativeExpr
                     |
            +--------+-------+
            |        |       |
          Expr   Multiply  Expr
            |        |       |
        PathExpr     |    PathExpr
            |        |       |
            /        *      foo

    so what is the point of disallowing this parse?

    At first, I thought this was a bad attempt to convey a lexing rule
    that would allow the lexer to easily decide (in this particular
    context) whether '*' is a Star or a Multiply:
            After a Slash (and optional whitespace),
            a '*' is a Star, not a Multiply.

    But now I think this para was added to deal with a grammatical
    ambiguity that occurs if the lexer *doesn't* distinguish between
    Star and Multiply. Consider the query:
        / * / foo 

    It can be derived in two different ways:

                   Expr
                    |
           MultiplicativeExpr
                    |
           +--------+-------+
           |        |       |
          Expr      |      Expr
           |        |       |
        PathExpr    |    PathExpr
           |        |       |
           /        *      /foo
    and:
                 Expr
                  |
          AbsolutePathExpr
                  |
          +-------+------+
          |              |
        Slash    RelativePathExpr
          |              |
          |      +-------+------+
          |      |       |      |
          |   StepExpr Slash StepExpr
          |      |       |      |
          /      *       /     foo
      
    Because the latter derivation is more likely to be the intended one,
    I think you're proposing to disallow the first via the rule:

        A PathExpr that consists of just a Slash is not allowed as the
        left operand of a MultiplicativeExpr in which the operator is an
        asterisk.

    which leads to the following parsing rule:

        After recognizing the initial Slash of an AbsolutePathExpr,
        if the next token is an asterisk, assume that it is a Wildcard
        that continues the AbsolutePathExpr, rather than the operator of
        a MultiplicativeExpr.

    The paragraph in question could be a lot clearer about what you're
    doing and why.

--------------------------
TRANSITION STATES

The phrase "transition states" is odd. I think "transition table" or
"transition function" would make more sense.

Of 142 symbols defined under TERMINALS, only 40 appear in the "tokens"
column. What about all the others?

What does the first row mean? There's no state listed under "recognize
state". I think you mean it to be DEFAULT.

Do you really mean to allow ExprComment in ELEMENT_CONTENT? It seems
kind of like allowing a comment in a string literal.

I suspect it's a mistake to recognize:
    CdataSectionEnd in ELEMENT_CONTENT,
    StartTagOpen    in QUOT_ATTRIBUTE_CONTENT,
    StartTagOpen    in APOS_ATTRIBUTE_CONTENT,
    Lbrace          in END_TAG, and
    Lbrace          in DEFAULT.

I think it's a bad idea to introduce lexical states and the transition
table. As an analogy, if you want to define the syntax of a
NumericLiteral, you don't write down the transition table for a DFA
that recognizes NumericLiterals, you write an equivalent regular
expression. Here, you're defining a push-down automaton, where you
should be writing an equivalent context-free grammar. In each case,
people find the latter much easier to read.  I've attached what I
believe is an equivalent CFG. (Note that this renders the above mistakes
more obvious.)

But this raises the question of why you need two CFGs to define the
language. And the answer is, you don't. The second is superfluous. Once
you eliminate its mistakes, it tells you nothing that you couldn't
deduce from the first.

So you might ask, "How are we supposed to define the tokenization of
XQuery then?" To which I would respond, "Why do you think you have to?"
Note that the XML spec doesn't define a tokenization for XML. It simply
gives a complete character-level grammar for the language. (That is, a
grammar whose terminal symbols are individual characters.) I don't
think this would be too hard to do for XQuery. You've already got
pretty much all the productions you'd need -- you'd just have to tweak
some of them a little.

-Michael Dyck

------------------------------------------------------------------------
            A grammar that generates the same language
            as the push-down automaton defined in A.3.1

DEFAULT ::=
    braced_content* Rbrace

braced_content ::=
      WhitespaceChar
    | Nmstart
    | NCName
    | Nmchar
    | Digits
    | Letter
    | BaseChar
    | Ideographic
    | CombiningChar
    | Digit
    | Extender
    | HexDigits
    | Whitespace
    | S
    | ExprComment
    | xml_comment
    | pi
    | tagged_thing
    | braced_thing            # I suspect a mistake. (5)

xml_comment ::=
    XmlCommentStart Char* XmlCommentEnd

pi ::=
    ProcessingInstructionStart (PITarget|Char)* ProcessingInstructionEnd

tagged_thing ::=
    StartTagOpen
    start_tag_content*
    ( EmptyTagClose
    | StartTagClose
      element_content*
      EndTagOpen
      ( TagQName
      | braced_thing          # I suspect a mistake. (4)
      )*
      EndTagClose
    )

start_tag_content ::=
      TagQName
    | ValueIndicator
    | OpenQuot quot_apos_content* CloseQuot
    | OpenApos quot_apos_content* CloseApos
    | braced_thing

quot_apos_content ::=
      Char
    | CharRef
    | PredefinedEntityRef
    | LCurlyBraceEscape
    | RCurlyBraceEscape 
    | tagged_thing            # I suspect a mistake. (2+3)
    | braced_thing

element_content ::=
      Char
    | CharRef
    | PredefinedEntityRef
    | LCurlyBraceEscape
    | RCurlyBraceEscape
    | ExprComment             # intended?
    | xml_comment
    | pi
    | cdata_section
    | CdataSectionEnd         # I suspect a mistake. (1)
    | tagged_thing
    | braced_thing

cdata_section ::=
    CdataSectionStart Char* CdataSectionEnd

braced_thing ::=
    Lbrace braced_content* Rbrace

------------------------------------------------------------------------

Received on Wednesday, 2 January 2002 02:57:00 UTC