ORA-XQ-285-B: Two ideas to deal with comments, etc. from Stephen Buxton on 2004-02-16 (public-qt-comments@w3.org from February 2004)

From: Stephen Buxton <Stephen.Buxton@oracle.com>
Date: 16 Feb 04 14:22:16
To: public-qt-comments@w3.org
Cc:
Message-Id: <200402162222.i1GMMG223381@rgmgw6.us.oracle.com>
SECTION A.2.2: Lexical rules

Today's lexical analyser is a 'two-stage' analyzer.  The bottom
stage, not explicitly mentioned in the Appendix, I will call the
raw tokenizer.  This stage is responsible for detecting things
like NCName and NumericLiteral.  The stage above it is responsible
for discerning when an NCName is a keyword, and for handling
comments, pragmas and must-understand extensions.

The design goals that make lexical analysis for XQuery difficult
are: no reserved words; nested comments; and the context-sensitivity
inherent in supporting direct constructors as a sublanguage with
different whitespace and comment rules from the containing language.

In a lexical analyzer with reserved words, the keywords can be
detected in the raw tokenizer stage. Frequently the raw tokenizer
stage also detects and ignores comments.  For such a language,
a single stage, the raw tokenizer, is sufficient.

In languages that only support unnested comments, it is possible
to recognize comments as regular expressions.  The usual way to
recognize regular expressions is with a finite state automaton.
XQuery has opted to support nested comments, which means that
comments are not a regular expression; instead they constitute
a 'context-free' language.  The usual way to recognize a context-free
language is by adding a stack to a finite state automaton.

The current design of the lexical analyzer is with a raw tokenizer
that recognizes tokens defined as regular expressions.  Since
the raw tokenizer is not powerful enough to handle nested comments,
comment handling has been pushed into a stage above the raw
tokenizer, where there is a stack.  This stage has also been given
the responsibility for deciding when an NCName is a keyword.
However, these two responsibilities are not easily merged in a
single stage.  The solution propounded so far has been to prohibit
comments in those contexts which are necessary to recognize certain
keywords.  However, prohibiting comments between
certain pairs of keywords is a major usability disservice.

I think the solution is that the keyword recognizer needs to be
at a higher stage than the comment recognizer.  There are two
ways to do this:

1. Abandon nested comment support.  Most high level languages
do not support nested comments, so there is ample precedent.
Users are accustomed to this restriction.  In addition, if it
came to a choice between nested comments, and the freedom to
put comments anywhere between keywords, I would gladly
sacrifice the nested comments, and I think most users would too.
Making this decision
would mean that comments would be regular expressions, and could
be recognized and removed in the first stage, the raw tokenizer.
This decision would also simplify the syntax and analysis of
comment-like things (pragmas and must-understand extensions).
Overall, the decision would be that there is no nesting of
comments, pragmas or must-understand extensions in one another.

2. If you really have to have nested comments, then you should go
to a three-stage lexical analyzer.  The bottom stage would be a
raw tokenizer, which would detect (: (:: :) and ::) as tokens.
The second stage above would run the stack to determine the boundaries of
comments, pragmas and must-understand extensions.  Finally, the
top stage would recognize keywords.


- Steve B.
Received on Monday, 16 February 2004 17:22:18 UTC