- From: Stephen Buxton <Stephen.Buxton@oracle.com>
- Date: 16 Feb 04 14:22:16
- To: public-qt-comments@w3.org
- Cc:
SECTION A.2.2: Lexical rules Today's lexical analyser is a 'two-stage' analyzer. The bottom stage, not explicitly mentioned in the Appendix, I will call the raw tokenizer. This stage is responsible for detecting things like NCName and NumericLiteral. The stage above it is responsible for discerning when an NCName is a keyword, and for handling comments, pragmas and must-understand extensions. The design goals that make lexical analysis for XQuery difficult are: no reserved words; nested comments; and the context-sensitivity inherent in supporting direct constructors as a sublanguage with different whitespace and comment rules from the containing language. In a lexical analyzer with reserved words, the keywords can be detected in the raw tokenizer stage. Frequently the raw tokenizer stage also detects and ignores comments. For such a language, a single stage, the raw tokenizer, is sufficient. In languages that only support unnested comments, it is possible to recognize comments as regular expressions. The usual way to recognize regular expressions is with a finite state automaton. XQuery has opted to support nested comments, which means that comments are not a regular expression; instead they constitute a 'context-free' language. The usual way to recognize a context-free language is by adding a stack to a finite state automaton. The current design of the lexical analyzer is with a raw tokenizer that recognizes tokens defined as regular expressions. Since the raw tokenizer is not powerful enough to handle nested comments, comment handling has been pushed into a stage above the raw tokenizer, where there is a stack. This stage has also been given the responsibility for deciding when an NCName is a keyword. However, these two responsibilities are not easily merged in a single stage. The solution propounded so far has been to prohibit comments in those contexts which are necessary to recognize certain keywords. However, prohibiting comments between certain pairs of keywords is a major usability disservice. I think the solution is that the keyword recognizer needs to be at a higher stage than the comment recognizer. There are two ways to do this: 1. Abandon nested comment support. Most high level languages do not support nested comments, so there is ample precedent. Users are accustomed to this restriction. In addition, if it came to a choice between nested comments, and the freedom to put comments anywhere between keywords, I would gladly sacrifice the nested comments, and I think most users would too. Making this decision would mean that comments would be regular expressions, and could be recognized and removed in the first stage, the raw tokenizer. This decision would also simplify the syntax and analysis of comment-like things (pragmas and must-understand extensions). Overall, the decision would be that there is no nesting of comments, pragmas or must-understand extensions in one another. 2. If you really have to have nested comments, then you should go to a three-stage lexical analyzer. The bottom stage would be a raw tokenizer, which would detect (: (:: :) and ::) as tokens. The second stage above would run the stack to determine the boundaries of comments, pragmas and must-understand extensions. Finally, the top stage would recognize keywords. - Steve B.
Received on Monday, 16 February 2004 17:22:18 UTC