XQuery lexical stuff again from Michael Dyck on 2002-08-19 (public-qt-comments@w3.org from August 2002)

From: Michael Dyck <jmdyck@ibiblio.org>
Date: Mon, 19 Aug 2002 00:34:45 -0400 (EDT)
To: public-qt-comments@w3.org
Message-id: <3D605AE2.553D1761@ibiblio.org>
XQuery 1.0: An XML Query Language 
W3C Working Draft 16 August 2002

---------------------------------------------------------------------------
3.1.5 Comments

"Comments may be used before and after major tokens within expressions and
within element content."
    What is a "major token"?

---------------------------------------------------------------------------
A.1 Lexical structure

"Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646, as long as these characters are
legal XML characters as defined in the [XML] recommendation."
    Why not just say:
        Legal characters are those allowed in the [XML] recommendation.

"A lexeme is the smallest meaningful unit in the grammar that has syntactic
interpretation."
    This doesn't appear to be true. For instance, the subsequent table
    indicates that 'p:foo' is a lexeme, but it seems clear (to me, anyway)
    that 'p' and 'foo' are smaller meaningful units that have syntactic
    interpretation. (Note that I don't object to calling 'p:foo' a lexeme,
    I just object to this definition.)

    I think it's pointless to try to give that kind of definition for
    'lexeme'. You'd be much better off saying something like:
        A lexeme is any sequence of characters derived from one of the
        following symbols: ...

    Better yet, how about this:
        For the grammar presented in A.2, the terminal symbols are:
        (1) all of the quoted strings appearing in the grammar, and
        (2)
            NCName
            QName

            IntegerLiteral
            DecimalLiteral
            DoubleLiteral
            StringLiteral

            S
            EscapeQuot
            URLLiteral
            PITarget
            VarName
            FuncName
            NCNameForPrefix
            PredefinedEntityRef
            CharRef
            EscapeApos
            Char

        [You might say that these terminal symbols are "lexeme symbols" or
        "lexical symbols".]
        A lexeme is any sequence of characters derived from (matching) one
        of these symbols. (It is an 'instance' of that symbol.)

"A token is a symbol that matches lexemes,"
    A token is not a symbol. Normally, a token is an *instance* of a
    symbol. But here, it doesn't even appear to be that. See later.

"and is the output of the lexical analyzer."
    Why couldn't a lexical analyzer output lexemes?

"A token symbol is the symbolic name given to that token."
    I wouldn't say that we "give names to tokens".  Really, a token symbol
    is any symbol whose instances you choose to call tokens. So the best
    way to define token symbols would be by listing them.

"A single token may be composed of one or more lexemes. If there is more
than one lexeme, they may be separated by whitespace or punctuation."
    Surely not punctuation: if two lexemes are separated by punctuation,
    the punctuation would itself be a lexeme.

"For instance, the token AxisDescendantOrSelf has two lexemes,
"descendant-or-self" and "::"."
    Except that there is no such token (token symbol) any more. None of the
    multi-part token symbols exist any more. So if you want to say that the
    combination of "descendant-or-self" and "::" constitutes a token, then
    it's a token that is not an instance of any symbol.

    Given this, I think things might be clearer (and closer to standard
    terminology) if you made the following changes in nomenclature:
        "token"  -> "token phrase" or "token pattern"
        "lexeme" -> "token"
    (Note that "token symbols" can stay pretty much as is, because
    everything currently called a token symbol is also what would currently
    be called a lexeme symbol, and thus what would, after these changes, be
    called a token symbol.)

"For instance, an implementation may decide that a token named 'For' ..."
    Huh? The token symbol 'For' doesn't exist any more.

"... is composed of only "for", or may decide that it is composed of
("for" "(")."
    Why would you have "for" followed by an open paren? Did you mean "$"
    instead of "("?

    The idea that an implementation can decide what constitues a token
    seems at odds with the rest of A.1.

"In the first case the implementation may decide to use lexical lookahead
to distinguish the "for" lexeme from a QName that has the lexeme "for"."
    It might be clearer to say:
        ... to distinguish the keyword "for" from the QName "for".

    Mind you, A.3 says that "for" is a reserved word. If that means that
    it's never allowed as a QName, then you wouldn't need lexical lookahead
    to distinguish the two cases: the second is always illegal. In which
    case, this is a poor example.  Or maybe A.3 means something else.

"Lexemes that must be described by lexical lookahead ..."
    You've just considered a case that could be handled/described by
    lexical lookahead *or* by "long tokens". The "must" in the phrase in
    question implies that some cases can *only* be handled/described by
    lexical lookahead.  I don't think this is what you meant.

"... are delimited with the tokens that it must look ahead to, in order to
be recoginized, by "<" and ">"."
    This is fairly clunky phrasing. I think this would convery your meaning
    better:
        In the BNF, the notation "< ... >" is used to indicate/delimit
        a sequence of lexemes that must be recognized using lexical
        lookahead or some equivalent means.

    (Delete the first "i" in "recoginized".)

"This grammar implies lexical states"
    No, lexical states are an aspect of a particular implementation
    strategy. The grammar does not imply them.

"the normative rules for calculating these states are given in the A.1.2
Lexical Rules section."
    No such rules appear.

"Whitespace may be freely added between lexemes, except a few cases where
whitespace is needed to disambiguate the token."
    So actually, whitespace may be freely added there as well; what you
    can't do in those cases is freely *subtract* whitespace.

"Whitespace is not freely allowed in the Constructor productions"
    Except it's presumably allowed in the computed constructor productions.

"but is specified specifically in the grammar"
    "specified specifically" is a bit clunky. Maybe change "specifically"
    to "explicitly".

"Lexically, these states are as follows"
    There isn't an antecedent for "these states".

---------------------------------------------------------------------------
A.1.1 Syntactic Constructs

[149] Nmstart
[150] Nmchar
[232] NCNameForPrefix
    The definition for Nmchar is the same as the definition of NCNameChar
    in XML Namespaces.  Why not use it?

    In fact, the only use of Nmstart and Nmchar is in NCNameForPrefix,
    which is equivalent to NCName. Why not just use NCName, and drop
    Nmstart and Nmchar completely?

[193] Digits
[236] HexDigits
    I don't think you want these to be token symbols, because they only
    occur within things that you *do* want to be token symbols: numeric
    literals and character references.

[229] FuncName
    The definition is the same as that of QName. Why not use it?

[255] WhitespaceChar
    It doesn't make much sense to call this a token symbol, because it
    only occurs in the definition of S, which is (supposedly) a token
    symbol itself.

---------------------------------------------------------------------------
A.1.2 Lexical Rules

"there are various strategies that can be used by an implementation to
disambiguate token symbol choices"
    Disambiguation is something done by the language specification, not
    implementations.

"This specification does not dictate what strategy to use."
    Hurray!

"However, this section does describe normative rules with which these
decisions must conform to. ... An implementation need not follow this
approach in implementing lexer rules, but does need to conform to the
results."
    Argh! How can I convince you what a bad idea this is? Perhaps an
    analogy would help. Imagine that the spec said this:

        There are various strategies that can be used to parse queries.
        Among the choices are recursive descent, LL(k), LR(k), GLR(k),
        et cetera. This specification does not dictate what strategy
        to use.  However, this section presents a normative LL(1)
        parsing automaton which parsers must conform to.

    Such a statement would be met (I suspect) with howls of indignation, as
    (a) it favours a particular implementation strategy, making conformance
        more difficult for anyone choosing to use a different strategy; and
    (b) it's completely unnecessary, since the *grammar* is already all the
        specification that implementers need. (Moreover, it's more concise,
        more declarative, and more likely to be bug-free.)

    Both of these arguments carry over to lexing. It isn't the spec's job
    to define the lexer any more than it's the spec's job to define the
    parser. Instead, it's the spec's job to define the *language*.

    Note that I don't much mind if the spec contains this lexical
    automaton, as long as it isn't normative in any way.

    (By the way, "with which these decisions must conform to" has an extra
    preposition. Delete "with", say.)

"For instance, instead of using ..."
    This sentence seems to belong more to the previous paragraph. Why not
    combine it with the "Among the choices" sentence?

"a state automata"
    Change "automata" to "automaton".

"an implementation might use lexecal look-behind"
    Change "lexecal" to "lexical".

"a more ambiguous token strategy"
    Ambiguity is a property of grammars, so it's probably misleading to
    use it to describe an implementation strategy.

---------------------------------------------------------------------------
A.2 BNF

I think it would make more sense to put this section before A.1 Lexical
Structure. In the same way that the grammar in this section has been laid
out in a roughly top-down order (symbols are generally used, then defined),
the whole of Appendix A could be laid out this way. (The phrase-structure
grammar is ultimately defined in terms of terminal/lexical/token symbols,
which are then defined in terms of sub-lexical symbols, which are then
defined in terms of character classes.) This is closer to how Appendix A
used to be laid out; I'm not sure why it was changed.

"The following grammar uses the same Basic EBNF notation as [XML], except
that grammar symbols always have initial capital letters."
    I still wonder what the reason for this exception is.

    There's another exception: the use of < and > as delimiters.

"The EBNF contains the lexemes embedded in the productions."
    Lexemes don't occur in the grammar, they occur in the texts derived
    from (or matching) the grammar. Presumably you're referring to the
    presence of quoted strings in the grammar. While this is different from
    previous XQuery drafts, it isn't different from the EBNF notation of
    XML or XPath 1.0 or most other places, so it's not really worth
    pointing out.

---------------------------------------------------------------------------
A.3 Reserved Words

"The following is a list of reserved words for XQuery"
    But you don't define what this means. Presumably it means that their
    use is illegal in certain contexts where it would otherwise appear to
    be legal, but you need to specify what those contexts are.

---------------------------------------------------------------------------

-Michael Dyck
Received on Monday, 19 August 2002 04:40:17 UTC