- From: Michael Dyck <jmdyck@ibiblio.org>
- Date: Mon, 19 Aug 2002 00:34:45 -0400 (EDT)
- To: public-qt-comments@w3.org
XQuery 1.0: An XML Query Language W3C Working Draft 16 August 2002 --------------------------------------------------------------------------- 3.1.5 Comments "Comments may be used before and after major tokens within expressions and within element content." What is a "major token"? --------------------------------------------------------------------------- A.1 Lexical structure "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646, as long as these characters are legal XML characters as defined in the [XML] recommendation." Why not just say: Legal characters are those allowed in the [XML] recommendation. "A lexeme is the smallest meaningful unit in the grammar that has syntactic interpretation." This doesn't appear to be true. For instance, the subsequent table indicates that 'p:foo' is a lexeme, but it seems clear (to me, anyway) that 'p' and 'foo' are smaller meaningful units that have syntactic interpretation. (Note that I don't object to calling 'p:foo' a lexeme, I just object to this definition.) I think it's pointless to try to give that kind of definition for 'lexeme'. You'd be much better off saying something like: A lexeme is any sequence of characters derived from one of the following symbols: ... Better yet, how about this: For the grammar presented in A.2, the terminal symbols are: (1) all of the quoted strings appearing in the grammar, and (2) NCName QName IntegerLiteral DecimalLiteral DoubleLiteral StringLiteral S EscapeQuot URLLiteral PITarget VarName FuncName NCNameForPrefix PredefinedEntityRef CharRef EscapeApos Char [You might say that these terminal symbols are "lexeme symbols" or "lexical symbols".] A lexeme is any sequence of characters derived from (matching) one of these symbols. (It is an 'instance' of that symbol.) "A token is a symbol that matches lexemes," A token is not a symbol. Normally, a token is an *instance* of a symbol. But here, it doesn't even appear to be that. See later. "and is the output of the lexical analyzer." Why couldn't a lexical analyzer output lexemes? "A token symbol is the symbolic name given to that token." I wouldn't say that we "give names to tokens". Really, a token symbol is any symbol whose instances you choose to call tokens. So the best way to define token symbols would be by listing them. "A single token may be composed of one or more lexemes. If there is more than one lexeme, they may be separated by whitespace or punctuation." Surely not punctuation: if two lexemes are separated by punctuation, the punctuation would itself be a lexeme. "For instance, the token AxisDescendantOrSelf has two lexemes, "descendant-or-self" and "::"." Except that there is no such token (token symbol) any more. None of the multi-part token symbols exist any more. So if you want to say that the combination of "descendant-or-self" and "::" constitutes a token, then it's a token that is not an instance of any symbol. Given this, I think things might be clearer (and closer to standard terminology) if you made the following changes in nomenclature: "token" -> "token phrase" or "token pattern" "lexeme" -> "token" (Note that "token symbols" can stay pretty much as is, because everything currently called a token symbol is also what would currently be called a lexeme symbol, and thus what would, after these changes, be called a token symbol.) "For instance, an implementation may decide that a token named 'For' ..." Huh? The token symbol 'For' doesn't exist any more. "... is composed of only "for", or may decide that it is composed of ("for" "(")." Why would you have "for" followed by an open paren? Did you mean "$" instead of "("? The idea that an implementation can decide what constitues a token seems at odds with the rest of A.1. "In the first case the implementation may decide to use lexical lookahead to distinguish the "for" lexeme from a QName that has the lexeme "for"." It might be clearer to say: ... to distinguish the keyword "for" from the QName "for". Mind you, A.3 says that "for" is a reserved word. If that means that it's never allowed as a QName, then you wouldn't need lexical lookahead to distinguish the two cases: the second is always illegal. In which case, this is a poor example. Or maybe A.3 means something else. "Lexemes that must be described by lexical lookahead ..." You've just considered a case that could be handled/described by lexical lookahead *or* by "long tokens". The "must" in the phrase in question implies that some cases can *only* be handled/described by lexical lookahead. I don't think this is what you meant. "... are delimited with the tokens that it must look ahead to, in order to be recoginized, by "<" and ">"." This is fairly clunky phrasing. I think this would convery your meaning better: In the BNF, the notation "< ... >" is used to indicate/delimit a sequence of lexemes that must be recognized using lexical lookahead or some equivalent means. (Delete the first "i" in "recoginized".) "This grammar implies lexical states" No, lexical states are an aspect of a particular implementation strategy. The grammar does not imply them. "the normative rules for calculating these states are given in the A.1.2 Lexical Rules section." No such rules appear. "Whitespace may be freely added between lexemes, except a few cases where whitespace is needed to disambiguate the token." So actually, whitespace may be freely added there as well; what you can't do in those cases is freely *subtract* whitespace. "Whitespace is not freely allowed in the Constructor productions" Except it's presumably allowed in the computed constructor productions. "but is specified specifically in the grammar" "specified specifically" is a bit clunky. Maybe change "specifically" to "explicitly". "Lexically, these states are as follows" There isn't an antecedent for "these states". --------------------------------------------------------------------------- A.1.1 Syntactic Constructs [149] Nmstart [150] Nmchar [232] NCNameForPrefix The definition for Nmchar is the same as the definition of NCNameChar in XML Namespaces. Why not use it? In fact, the only use of Nmstart and Nmchar is in NCNameForPrefix, which is equivalent to NCName. Why not just use NCName, and drop Nmstart and Nmchar completely? [193] Digits [236] HexDigits I don't think you want these to be token symbols, because they only occur within things that you *do* want to be token symbols: numeric literals and character references. [229] FuncName The definition is the same as that of QName. Why not use it? [255] WhitespaceChar It doesn't make much sense to call this a token symbol, because it only occurs in the definition of S, which is (supposedly) a token symbol itself. --------------------------------------------------------------------------- A.1.2 Lexical Rules "there are various strategies that can be used by an implementation to disambiguate token symbol choices" Disambiguation is something done by the language specification, not implementations. "This specification does not dictate what strategy to use." Hurray! "However, this section does describe normative rules with which these decisions must conform to. ... An implementation need not follow this approach in implementing lexer rules, but does need to conform to the results." Argh! How can I convince you what a bad idea this is? Perhaps an analogy would help. Imagine that the spec said this: There are various strategies that can be used to parse queries. Among the choices are recursive descent, LL(k), LR(k), GLR(k), et cetera. This specification does not dictate what strategy to use. However, this section presents a normative LL(1) parsing automaton which parsers must conform to. Such a statement would be met (I suspect) with howls of indignation, as (a) it favours a particular implementation strategy, making conformance more difficult for anyone choosing to use a different strategy; and (b) it's completely unnecessary, since the *grammar* is already all the specification that implementers need. (Moreover, it's more concise, more declarative, and more likely to be bug-free.) Both of these arguments carry over to lexing. It isn't the spec's job to define the lexer any more than it's the spec's job to define the parser. Instead, it's the spec's job to define the *language*. Note that I don't much mind if the spec contains this lexical automaton, as long as it isn't normative in any way. (By the way, "with which these decisions must conform to" has an extra preposition. Delete "with", say.) "For instance, instead of using ..." This sentence seems to belong more to the previous paragraph. Why not combine it with the "Among the choices" sentence? "a state automata" Change "automata" to "automaton". "an implementation might use lexecal look-behind" Change "lexecal" to "lexical". "a more ambiguous token strategy" Ambiguity is a property of grammars, so it's probably misleading to use it to describe an implementation strategy. --------------------------------------------------------------------------- A.2 BNF I think it would make more sense to put this section before A.1 Lexical Structure. In the same way that the grammar in this section has been laid out in a roughly top-down order (symbols are generally used, then defined), the whole of Appendix A could be laid out this way. (The phrase-structure grammar is ultimately defined in terms of terminal/lexical/token symbols, which are then defined in terms of sub-lexical symbols, which are then defined in terms of character classes.) This is closer to how Appendix A used to be laid out; I'm not sure why it was changed. "The following grammar uses the same Basic EBNF notation as [XML], except that grammar symbols always have initial capital letters." I still wonder what the reason for this exception is. There's another exception: the use of < and > as delimiters. "The EBNF contains the lexemes embedded in the productions." Lexemes don't occur in the grammar, they occur in the texts derived from (or matching) the grammar. Presumably you're referring to the presence of quoted strings in the grammar. While this is different from previous XQuery drafts, it isn't different from the EBNF notation of XML or XPath 1.0 or most other places, so it's not really worth pointing out. --------------------------------------------------------------------------- A.3 Reserved Words "The following is a list of reserved words for XQuery" But you don't define what this means. Presumably it means that their use is illegal in certain contexts where it would otherwise appear to be legal, but you need to specify what those contexts are. --------------------------------------------------------------------------- -Michael Dyck
Received on Monday, 19 August 2002 04:40:17 UTC