XQuery: A.1 Lexical structure from Michael Dyck on 2002-11-21 (public-qt-comments@w3.org from November 2002)

From: Michael Dyck <jmdyck@ibiblio.org>
Date: Wed, 20 Nov 2002 22:02:09 -0500 (EST)
To: public-qt-comments@w3.org
Message-id: <3DDC4A1C.FEBDD453@ibiblio.org>
XQuery 1.0: An XML Query Language
W3C Working Draft 13 November 2002

(This message deals with the part of A.1 that isn't in its subsections.)

---------------------------------------------------------------------------
para 2:

"A lexical pattern is a rule that describes how a sequence of characters
can match a grammar unit."
    (1) The word "rule" seems inappropriate here. "Expression" would be
        more accurate, but confusing. How about "form"? or "grammatical
        form"?

    (2) It doesn't really describe how a character-sequence can match,
        but rather defines which ones do match.

    (3) What is a "grammar unit"? The term is neither defined nor used
        elsewhere in the spec.
        For instance, the table for the DEFAULT state indicates that
            <"for" "$">
        is a pattern. So what grammar unit does "for $" match? It seems
        that the most you can say is that it matches the <"for" "$">
        pattern in production 45. So a pattern describes how a character-
        sequence matches a pattern?

    I think you'd be better off just deleting this sentence.

"A lexeme is the smallest meaningful unit in the grammar that has syntactic
interpretation."
    The subsequent table used to say that 'p:foo' was one lexeme, but that
    conflicted with the above sentence, because 'p' and 'foo' are smaller
    meaningful units. In response, the table now indicates that 'p:foo' is
    three lexemes. However, that's the wrong response to the conflict,
    because it implies that arbitrary whitespace is allowed around the
    colon, which is presumably not what you intended. Instead, the correct
    response is to remove all mention of "meaningful units" and "syntactic
    interpretation".
    As I said last time:
    - - - - - - - - - - - - - - - - - - - - - 
    I think it's pointless to try to give that kind of definition for
    'lexeme'. You'd be much better off saying something like:
        A lexeme is any sequence of characters derived from one of the
        following symbols: ...

    Better yet, how about this:
        For the grammar presented in A.2, the terminal symbols are:
        (1) all of the quoted strings appearing in the grammar, and
        (2)
            NCName
            QName

            IntegerLiteral
            DecimalLiteral
            DoubleLiteral
            StringLiteral

            S
            EscapeQuot
            URLLiteral
            PITarget
            VarName
            FuncName
            NCNameForPrefix
            PredefinedEntityRef
            CharRef
            EscapeApos
            Char

        [You might say that these terminal symbols are "lexeme symbols" or
        "lexical symbols".]
        A lexeme is any sequence of characters derived from (matching) one
        of these symbols. (It is an 'instance' of that symbol.)
    - - - - - - - - - - - - - - - - - - - - - 
    Upon reflection, I think I'd change that last sentence to:
        Within a [query/expression], a lexeme is any sequence of characters
        that is derived from (matches) any of these terminal symbols other
        than S.

"A token is a symbol that matches lexemes,"
    A token is not a symbol. Normally, a token is an *instance* of a
    symbol. But here, it doesn't even appear to be that.

"and is the output of the lexical analyzer."
    Why couldn't a lexical analyzer output lexemes?

"A token symbol is the symbolic name given to that token."
    I wouldn't say that we "give names to tokens".  Really, a token symbol
    is any symbol whose instances you choose to call tokens. So the best
    way to define token symbols would be by listing them.

"A single token may be composed of one or more lexemes. If there is more
than one lexeme, they may be separated by whitespace or punctuation."
    Surely not punctuation: if two lexemes are separated by punctuation,
    the punctuation would itself be a lexeme.

"For instance, a token AxisDescendantOrSelf might have two lexemes,
"descendant-or-self" and "::"."
    Or it might not? What might it have then?

    This "might", and the "(for example)" in the "Token Names" column,
    seem to be there because the spec doesn't actually define
    names/types/symbols for a lot of the things it calls tokens.  (e.g.,
    it doesn't define the tokens Or, Equals, or AxisDescendantOrSelf)
    Moreover, the paragraph after the table indicates that even where one
    draws the boundary between tokens is up to the implementation. So
    I think it's pointless for the spec to define or use the terms "token"
    or "token symbol", except when it's discussing possible implementation
    strategies. Everywhere else, I think you could pretty much just replace
    "token" with "lexeme". In some places, "sequence of lexemes" might be
    better. For instance, in the 3rd para of A.1.2, where it says
        "When a given token is recognized"
    you might need to say
        "When a given sequence of lexemes is recognized"
    but you could just say
        "When one of the patterns is matched"

---------------------------------------------------------------------------
table:

"(Prefix ':')? LocalPart"
    Note that no such pattern appears in the spec.

---------------------------------------------------------------------------
para 3:

"For example, an implementation may decide that a token named 'For' ..."
    I think this would be clearer:
        For example, one implementation may define a token [symbol] named
        'For', consisting of only "for". Another implementation may define
        a token 'For' to consist of both "for" and "$".

"In the first case the implementation may decide to use lexical lookahead
to distinguish the "for" lexeme from a QName that has the lexeme "for"."
    It would be clearer to say:
        ... to distinguish the keyword "for" from the QName "for".

"In the second case, the implementation may decide to combine the two
lexemes into a single "long" token."
    Doesn't that just restate what the earlier sentence said?

---------------------------------------------------------------------------
para 4:

"This grammar implies lexical states"
    No, lexical states are an aspect of a particular implementation
    strategy. The grammar does not imply them.

"normative rules for calculating these states are given in the A.1.2
Lexical Rules section."
    No such rules appear, only the resulting states.

---------------------------------------------------------------------------
para 5:

"When tokenizing, the longest possible match that is valid in the current
lexical state is prefered ."
    (1) Delete the space before the period.

    (2) The word "prefered" is kind of weak. It suggests "prefered but not
        required". Is this intentional?

---------------------------------------------------------------------------
para 6:

"For readability, Whitespace may be used..."
    There is no "Whitespace" symbol any more. There probably should be.
    How about:
        Whitespace ::= ( WhitespaceChar | ExprComment )+
    (This has the benefit of definining more formally how comments fit into
    the language.)

"Whitespace may be freely added between lexemes, except a few cases where
whitespace is needed to disambiguate the token."
    (1) In fact, whitespace may be freely *added* in those cases as well;
        what distinguishes those cases is that you can't freely *subtract*
        whitespace. (In particular, you can't remove all whitespace.)
        To address this, you might change:
            Whitespace may be freely added
        to:
            Any amount of whitespace (including none) may appear

    (2) Insert "in" before "a few", I think.

    (3) The phrase "disambiguate the token" is, I believe, a misuse of the
        concept of ambiguity. At any rate, I think it would be plainer and
        more accurate to say that whitespace is needed to prevent two
        adjacent lexemes from being (mis-)recognized as one.

        For instance, consider the character-sequence
            a- b
        Note that there is a space before the 'b'. It thus has only one
        derivation from Query (the "a minus b" one), so there is no
        ambiguity involved, no disambiguation needed. Nevertheless, it is
        still a case in which (I assume) whitespace is needed between 'a'
        and '-' to prevent the longest-match rule from (mis-)recognizing
        'a-' as a name.

    (4) Shouldn't you enumerate those "few cases"?

    (5) Can you confirm that
            10div3
        is a valid query, meaning the same thing as
            10 div 3
        ? (And if not, why not?)

    (6) This only refers to whitespace *between* lexemes, but what about
        before the first lexeme of a query, and after the last? I imagine
        it's allowed there too.

    (7) At the beginning of this sentence, I think you should insert:
            For productions without a "ws" marking

---------------------------------------------------------------------------
para 7:

"Special whitespace notation"
    Note that only the *notation* is special. The treatment of whitespace
    characters in "ws: explicit" and "ws: significant" productions is *not*
    special: they treat them like any other character, just as the XML spec
    does.  It's the *unmarked* productions that have special interpretation
    with respect to whitespace.

"when it is different from the default rules"
    What does "the default rules" refer to? I think it's just the previous
    paragraph, but it's not entirely clear.

"where whitespace is allowed must be explicitly notated in the BNF"
    It might be clearer to say:
        "whitespace is allowed only where explicitly notated in the BNF"

'"ws: significant" means that whitespace is significant as value content'
    More importantly (from the point of view of lexing), it also means
    that, just like "ws: explicit", whitespace is allowed only where the
    EBNF explicitly allows it. (In fact, from the point of view of lexing,
    there's no difference between the two markings.)

    It's not clear how explicit and implicit whitespace interact. For
    instance, consider two adjacent lexemes, one directly derived from a
    production with a "ws" marking, and one from an unmarked production.
    Is implicit whitespace allowed between them or not? (The answer, I
    suspect, is that it depends. But what it depends on is somewhat
    tricky.)

---------------------------------------------------------------------------
para 8:

"For XQuery, Whitespace is not freely allowed in the non-computed
Constructor productions, but is specified explicitly in the grammar"
    This appears to be equivalent to pointing out that the productions for
    ElementConstructor and AttributeList are marked "ws: explicit".
    The sentence was necessary in the previous draft (which didn't have
    the idea of "ws: explicit"), but it isn't necessary any more.
    If you wanted to retain it, it might make more sense (reworded
    somewhat) after the sentence that introduces "ws: explicit".

"The lexical states where whitespace must have explicit specification are
as follows: ..."
    I don't think this sentence is saying anything useful. For instance, it
    lists START_TAG and END_TAG as states where whitespace must have
    explicit specification. Well, they do: both have explicit transitions
    on S. But so do 7 other states that aren't listed. And it lists
    PROCESSING_INSTRUCTION, which has no transition on S, and presumably
    shouldn't.

-Michael Dyck
Received on Friday, 22 November 2002 00:34:04 UTC