- From: Michael Dyck <MichaelDyck@shaw.ca>
- Date: Tue, 01 Jan 2002 23:53:29 -0800
- To: www-xml-query-comments@w3.org
XQuery 1.0: An XML Query Language W3C Working Draft 20 December 2001 Here are some comments on A.3 Lexical structure. ------------------------------------------------------------------------ para 1: "Whitespace may be freely added within patterns" What do you mean by "patterns"? Presumably, you're either talking about adding the symbol 'Whitespace' to grammar productions, or adding whitespace (i.e., sequences of characters) to queries. Don't confuse the two. "before or after any token" But you never actually define what a token is. It's not even clear what the set of token-types is. (Is it the set of left-hand-sides of productions 75 through 216? Is it the set of symbols that appear in the "tokens" column of the TRANITION STATES table? The two are different, and both contain symbols that probably shouldn't be considered token-types.) para 1 and bullets 1 and 2: Note that the Whitespace symbol derives the empty string, but phrases like "must always be followed by whitespace" and "whitespace may not occur" obviously mean "whitespace" in the sense of "a non-empty string of whitespace-characters". I think this shows correct usage, and there's no reason for Whitespace to be nullable. (That is, it should be the same as S.) bullet 3: "A space" We're interested in whitespace, not just a space. "may be significant" Don't tell us that it *may* be significant. Tell us exactly when it *is* significant. para 2: "Tokens may be often only recognized" "may be often only" is clunky. "in a specific state" You haven't defined states yet. "within the evaluation": Does evaluation of a query include its parsing/lexing? "may cause the grammar to transition to a different state" Grammars don't have states or transitions. Automata do. "following the enumeration of tokens" Change "tokens" to "token-types". para 3: "When tokenizing, the longest possible token is always returned" Issue 109 says this means "the longest sequence that would form a token in the token-space of the grammar, not the longest that would be valid in the current syntactic context." Does it? "If there is an ambiguity between two tokens, ..." Presumably, you mean an ambiguity that isn't resolved by the longest-match rule. "the token that an lower grammar number" Change "an" to "a". "is more specific than" Why do we care which is "more specific"? We want to know which is the right one. I'll assume that's what you mean. I'm very suspicious of this kind of blanket rule. You're liable to shoot yourself in the foot. In fact, I believe you have: -- S [78] precedes Whitespace [210], so WhitespaceChar+ will always tokenize as S rather than Whitespace. -- Nmstart [123] precedes Nmchar [124], Char [208], Letter [211], and BaseChar [212], so [a-zA-Z_] will always tokenize as Nmstart. -- Digits [166] precedes HexDigits [193], so [0-9]+ will always tokenize as Digits. -- Char [208] precedes WhitespaceChar [209], so #x9, #xA, #xD, and #x20 will always tokenize as Char rather than WhitespaceChar. (Personally, I'm not sure I'd consider any of these to be token types, but they're all in the "tokens" column of the Transition States table.) In fact, here are the only cases I could find where this rule makes a sensible ruling: ELEMENT_CONTENT: prefer Lbrace and StartTagOpen over Char. APOS_ATTRIBUTE_CONTENT: prefer Lbrace and CloseApos over Char. QUOT_ATTRIBUTE_CONTENT: prefer Lbrace and CloseQuot over Char. DEFAULT: prefer reserved word over QName. Of these, the first three could be handled explicitly in the grammar (e.g., in the definition of ElementContent, replace Char with (a symbol denoting) [^{<]). The last could be handled with a specific rule. para 5: "ExprComment tokens should be ignored by the parser." Ignored in what sense? Consider foo{-- comment --}bar If we completely ignore the comment, does the parser see foobar or does the comment function as whitespace? If the latter, then I suggest the following changes: [78] S ::= (WhitespaceChar | ExprComment)+ [210] Whitespace ::= (WhitespaceChar | ExprComment)* -------------------------- TERMINALS Where are productions [73] and [74]? Productions that are not used elsewhere in the grammar: [77] ExprComment (appears in transition table & body text) [106] Before [107] After [121] Ref [126] ColonStar [176] SemiColon [177] Colon [210] Whitespace (appears in transition table & body text) [77] ExprComment This seems like a poor name for the symbol, given that it's not a kind of expression. By the way, why did you drop single-line comments (# to line-end)? [87] DefineFunction: Is DefineFunction a token type? If the parser finds text matching the DefineFunction production, is that one token, or two? (Or three, if we count the intervening whitespace as well?) I think you'd be better off with the conventional view, that 'define' and 'function' are two separate tokens. So I suggest the following changes: [70] FunctionDefn ::= Define Function QName ... [87] Define ::= "define" (Note that [115] Function ::= "function" already exists.) Similarly for: [81] AxisChild [82] AxisDescendant [83] AxisParent [84] AxisAttribute [85] AxisSelf [86] AxisDescendantOrSelf [112] Instanceof [119] ElementOfType [163] CastAs [164] AssertAs [165] TreatAs [169] DoubleLiteral Change ([e]|[E]) to [eE]. Change ([+]|[-]) to [+-]. [189] QName The production allows tokens such as ':foo:bar', whereas 2.1 only says that an initial colon is allowed on "unprefixed QNames". [193] HexDigits Change ([0-9]|[a-f]|[A-F]) to [0-9a-fA-F]. [200] ValueIndicator Why introduce ValueIndicator? Why not just have AttributeList [64] use Equals [131]? [208] Char The outermost parentheses are unnecessary. Similarly for: [209] WhitespaceChar [212] BaseChar [213] Ideographic [214] CombiningChar [215] Digit [216] Extender [210] Whitespace It's pointless to have a token type that derives the empty string, because the "longest possible token" rule guarantees that (in well-formed queries) the returned token will never be empty. So you might as well change the '*' to a '+', in which case you have the same right-hand-side as S [78], so you should probably merge the two productions/symbols/concepts. -------------------------- A.3.1 Lexical States para 5: "To allow curly braces to be used as character content, a double left or right curly brace is interpreted as a single curly brace character." An alternative would be to use character references (e.g., { and }). para 7: "An operator that immediately follows a "/" or "//" when used as a root symbol, should not parse." Why not? You don't give any reason. The example given, / * foo, is unambiguous: MultiplicativeExpr | +--------+-------+ | | | Expr Multiply Expr | | | PathExpr | PathExpr | | | / * foo so what is the point of disallowing this parse? At first, I thought this was a bad attempt to convey a lexing rule that would allow the lexer to easily decide (in this particular context) whether '*' is a Star or a Multiply: After a Slash (and optional whitespace), a '*' is a Star, not a Multiply. But now I think this para was added to deal with a grammatical ambiguity that occurs if the lexer *doesn't* distinguish between Star and Multiply. Consider the query: / * / foo It can be derived in two different ways: Expr | MultiplicativeExpr | +--------+-------+ | | | Expr | Expr | | | PathExpr | PathExpr | | | / * /foo and: Expr | AbsolutePathExpr | +-------+------+ | | Slash RelativePathExpr | | | +-------+------+ | | | | | StepExpr Slash StepExpr | | | | / * / foo Because the latter derivation is more likely to be the intended one, I think you're proposing to disallow the first via the rule: A PathExpr that consists of just a Slash is not allowed as the left operand of a MultiplicativeExpr in which the operator is an asterisk. which leads to the following parsing rule: After recognizing the initial Slash of an AbsolutePathExpr, if the next token is an asterisk, assume that it is a Wildcard that continues the AbsolutePathExpr, rather than the operator of a MultiplicativeExpr. The paragraph in question could be a lot clearer about what you're doing and why. -------------------------- TRANSITION STATES The phrase "transition states" is odd. I think "transition table" or "transition function" would make more sense. Of 142 symbols defined under TERMINALS, only 40 appear in the "tokens" column. What about all the others? What does the first row mean? There's no state listed under "recognize state". I think you mean it to be DEFAULT. Do you really mean to allow ExprComment in ELEMENT_CONTENT? It seems kind of like allowing a comment in a string literal. I suspect it's a mistake to recognize: CdataSectionEnd in ELEMENT_CONTENT, StartTagOpen in QUOT_ATTRIBUTE_CONTENT, StartTagOpen in APOS_ATTRIBUTE_CONTENT, Lbrace in END_TAG, and Lbrace in DEFAULT. I think it's a bad idea to introduce lexical states and the transition table. As an analogy, if you want to define the syntax of a NumericLiteral, you don't write down the transition table for a DFA that recognizes NumericLiterals, you write an equivalent regular expression. Here, you're defining a push-down automaton, where you should be writing an equivalent context-free grammar. In each case, people find the latter much easier to read. I've attached what I believe is an equivalent CFG. (Note that this renders the above mistakes more obvious.) But this raises the question of why you need two CFGs to define the language. And the answer is, you don't. The second is superfluous. Once you eliminate its mistakes, it tells you nothing that you couldn't deduce from the first. So you might ask, "How are we supposed to define the tokenization of XQuery then?" To which I would respond, "Why do you think you have to?" Note that the XML spec doesn't define a tokenization for XML. It simply gives a complete character-level grammar for the language. (That is, a grammar whose terminal symbols are individual characters.) I don't think this would be too hard to do for XQuery. You've already got pretty much all the productions you'd need -- you'd just have to tweak some of them a little. -Michael Dyck ------------------------------------------------------------------------ A grammar that generates the same language as the push-down automaton defined in A.3.1 DEFAULT ::= braced_content* Rbrace braced_content ::= WhitespaceChar | Nmstart | NCName | Nmchar | Digits | Letter | BaseChar | Ideographic | CombiningChar | Digit | Extender | HexDigits | Whitespace | S | ExprComment | xml_comment | pi | tagged_thing | braced_thing # I suspect a mistake. (5) xml_comment ::= XmlCommentStart Char* XmlCommentEnd pi ::= ProcessingInstructionStart (PITarget|Char)* ProcessingInstructionEnd tagged_thing ::= StartTagOpen start_tag_content* ( EmptyTagClose | StartTagClose element_content* EndTagOpen ( TagQName | braced_thing # I suspect a mistake. (4) )* EndTagClose ) start_tag_content ::= TagQName | ValueIndicator | OpenQuot quot_apos_content* CloseQuot | OpenApos quot_apos_content* CloseApos | braced_thing quot_apos_content ::= Char | CharRef | PredefinedEntityRef | LCurlyBraceEscape | RCurlyBraceEscape | tagged_thing # I suspect a mistake. (2+3) | braced_thing element_content ::= Char | CharRef | PredefinedEntityRef | LCurlyBraceEscape | RCurlyBraceEscape | ExprComment # intended? | xml_comment | pi | cdata_section | CdataSectionEnd # I suspect a mistake. (1) | tagged_thing | braced_thing cdata_section ::= CdataSectionStart Char* CdataSectionEnd braced_thing ::= Lbrace braced_content* Rbrace ------------------------------------------------------------------------
Received on Wednesday, 2 January 2002 02:57:00 UTC