- From: Michael Dyck <jmdyck@ibiblio.org>
- Date: Wed, 20 Nov 2002 22:02:09 -0500 (EST)
- To: public-qt-comments@w3.org
XQuery 1.0: An XML Query Language W3C Working Draft 13 November 2002 (This message deals with the part of A.1 that isn't in its subsections.) --------------------------------------------------------------------------- para 2: "A lexical pattern is a rule that describes how a sequence of characters can match a grammar unit." (1) The word "rule" seems inappropriate here. "Expression" would be more accurate, but confusing. How about "form"? or "grammatical form"? (2) It doesn't really describe how a character-sequence can match, but rather defines which ones do match. (3) What is a "grammar unit"? The term is neither defined nor used elsewhere in the spec. For instance, the table for the DEFAULT state indicates that <"for" "$"> is a pattern. So what grammar unit does "for $" match? It seems that the most you can say is that it matches the <"for" "$"> pattern in production 45. So a pattern describes how a character- sequence matches a pattern? I think you'd be better off just deleting this sentence. "A lexeme is the smallest meaningful unit in the grammar that has syntactic interpretation." The subsequent table used to say that 'p:foo' was one lexeme, but that conflicted with the above sentence, because 'p' and 'foo' are smaller meaningful units. In response, the table now indicates that 'p:foo' is three lexemes. However, that's the wrong response to the conflict, because it implies that arbitrary whitespace is allowed around the colon, which is presumably not what you intended. Instead, the correct response is to remove all mention of "meaningful units" and "syntactic interpretation". As I said last time: - - - - - - - - - - - - - - - - - - - - - I think it's pointless to try to give that kind of definition for 'lexeme'. You'd be much better off saying something like: A lexeme is any sequence of characters derived from one of the following symbols: ... Better yet, how about this: For the grammar presented in A.2, the terminal symbols are: (1) all of the quoted strings appearing in the grammar, and (2) NCName QName IntegerLiteral DecimalLiteral DoubleLiteral StringLiteral S EscapeQuot URLLiteral PITarget VarName FuncName NCNameForPrefix PredefinedEntityRef CharRef EscapeApos Char [You might say that these terminal symbols are "lexeme symbols" or "lexical symbols".] A lexeme is any sequence of characters derived from (matching) one of these symbols. (It is an 'instance' of that symbol.) - - - - - - - - - - - - - - - - - - - - - Upon reflection, I think I'd change that last sentence to: Within a [query/expression], a lexeme is any sequence of characters that is derived from (matches) any of these terminal symbols other than S. "A token is a symbol that matches lexemes," A token is not a symbol. Normally, a token is an *instance* of a symbol. But here, it doesn't even appear to be that. "and is the output of the lexical analyzer." Why couldn't a lexical analyzer output lexemes? "A token symbol is the symbolic name given to that token." I wouldn't say that we "give names to tokens". Really, a token symbol is any symbol whose instances you choose to call tokens. So the best way to define token symbols would be by listing them. "A single token may be composed of one or more lexemes. If there is more than one lexeme, they may be separated by whitespace or punctuation." Surely not punctuation: if two lexemes are separated by punctuation, the punctuation would itself be a lexeme. "For instance, a token AxisDescendantOrSelf might have two lexemes, "descendant-or-self" and "::"." Or it might not? What might it have then? This "might", and the "(for example)" in the "Token Names" column, seem to be there because the spec doesn't actually define names/types/symbols for a lot of the things it calls tokens. (e.g., it doesn't define the tokens Or, Equals, or AxisDescendantOrSelf) Moreover, the paragraph after the table indicates that even where one draws the boundary between tokens is up to the implementation. So I think it's pointless for the spec to define or use the terms "token" or "token symbol", except when it's discussing possible implementation strategies. Everywhere else, I think you could pretty much just replace "token" with "lexeme". In some places, "sequence of lexemes" might be better. For instance, in the 3rd para of A.1.2, where it says "When a given token is recognized" you might need to say "When a given sequence of lexemes is recognized" but you could just say "When one of the patterns is matched" --------------------------------------------------------------------------- table: "(Prefix ':')? LocalPart" Note that no such pattern appears in the spec. --------------------------------------------------------------------------- para 3: "For example, an implementation may decide that a token named 'For' ..." I think this would be clearer: For example, one implementation may define a token [symbol] named 'For', consisting of only "for". Another implementation may define a token 'For' to consist of both "for" and "$". "In the first case the implementation may decide to use lexical lookahead to distinguish the "for" lexeme from a QName that has the lexeme "for"." It would be clearer to say: ... to distinguish the keyword "for" from the QName "for". "In the second case, the implementation may decide to combine the two lexemes into a single "long" token." Doesn't that just restate what the earlier sentence said? --------------------------------------------------------------------------- para 4: "This grammar implies lexical states" No, lexical states are an aspect of a particular implementation strategy. The grammar does not imply them. "normative rules for calculating these states are given in the A.1.2 Lexical Rules section." No such rules appear, only the resulting states. --------------------------------------------------------------------------- para 5: "When tokenizing, the longest possible match that is valid in the current lexical state is prefered ." (1) Delete the space before the period. (2) The word "prefered" is kind of weak. It suggests "prefered but not required". Is this intentional? --------------------------------------------------------------------------- para 6: "For readability, Whitespace may be used..." There is no "Whitespace" symbol any more. There probably should be. How about: Whitespace ::= ( WhitespaceChar | ExprComment )+ (This has the benefit of definining more formally how comments fit into the language.) "Whitespace may be freely added between lexemes, except a few cases where whitespace is needed to disambiguate the token." (1) In fact, whitespace may be freely *added* in those cases as well; what distinguishes those cases is that you can't freely *subtract* whitespace. (In particular, you can't remove all whitespace.) To address this, you might change: Whitespace may be freely added to: Any amount of whitespace (including none) may appear (2) Insert "in" before "a few", I think. (3) The phrase "disambiguate the token" is, I believe, a misuse of the concept of ambiguity. At any rate, I think it would be plainer and more accurate to say that whitespace is needed to prevent two adjacent lexemes from being (mis-)recognized as one. For instance, consider the character-sequence a- b Note that there is a space before the 'b'. It thus has only one derivation from Query (the "a minus b" one), so there is no ambiguity involved, no disambiguation needed. Nevertheless, it is still a case in which (I assume) whitespace is needed between 'a' and '-' to prevent the longest-match rule from (mis-)recognizing 'a-' as a name. (4) Shouldn't you enumerate those "few cases"? (5) Can you confirm that 10div3 is a valid query, meaning the same thing as 10 div 3 ? (And if not, why not?) (6) This only refers to whitespace *between* lexemes, but what about before the first lexeme of a query, and after the last? I imagine it's allowed there too. (7) At the beginning of this sentence, I think you should insert: For productions without a "ws" marking --------------------------------------------------------------------------- para 7: "Special whitespace notation" Note that only the *notation* is special. The treatment of whitespace characters in "ws: explicit" and "ws: significant" productions is *not* special: they treat them like any other character, just as the XML spec does. It's the *unmarked* productions that have special interpretation with respect to whitespace. "when it is different from the default rules" What does "the default rules" refer to? I think it's just the previous paragraph, but it's not entirely clear. "where whitespace is allowed must be explicitly notated in the BNF" It might be clearer to say: "whitespace is allowed only where explicitly notated in the BNF" '"ws: significant" means that whitespace is significant as value content' More importantly (from the point of view of lexing), it also means that, just like "ws: explicit", whitespace is allowed only where the EBNF explicitly allows it. (In fact, from the point of view of lexing, there's no difference between the two markings.) It's not clear how explicit and implicit whitespace interact. For instance, consider two adjacent lexemes, one directly derived from a production with a "ws" marking, and one from an unmarked production. Is implicit whitespace allowed between them or not? (The answer, I suspect, is that it depends. But what it depends on is somewhat tricky.) --------------------------------------------------------------------------- para 8: "For XQuery, Whitespace is not freely allowed in the non-computed Constructor productions, but is specified explicitly in the grammar" This appears to be equivalent to pointing out that the productions for ElementConstructor and AttributeList are marked "ws: explicit". The sentence was necessary in the previous draft (which didn't have the idea of "ws: explicit"), but it isn't necessary any more. If you wanted to retain it, it might make more sense (reworded somewhat) after the sentence that introduces "ws: explicit". "The lexical states where whitespace must have explicit specification are as follows: ..." I don't think this sentence is saying anything useful. For instance, it lists START_TAG and END_TAG as states where whitespace must have explicit specification. Well, they do: both have explicit transitions on S. But so do 7 other states that aren't listed. And it lists PROCESSING_INSTRUCTION, which has no transition on S, and presumably shouldn't. -Michael Dyck
Received on Friday, 22 November 2002 00:34:04 UTC