[Bug 1617] New: how are comments really parsed?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1617

           Summary: how are comments really parsed?
           Product: XPath / XQuery / XSLT
           Version: Last Call drafts
          Platform: PC
        OS/Version: Windows 2000
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XQuery/XPath Tokenizer
        AssignedTo: scott_boag@us.ibm.com
        ReportedBy: fred.zemke@oracle.com
         QAContact: public-qt-comments@w3.org


How to parse comments is still not clear, and is open to
conflicting interpretations.  The question is whether comments
count as whitespace within a so-called "long token".

The two possible interpretations are:

1. no, comments are not permitted in long tokens.
Evidence:  "Building a tokenizer for XPath and XQuery" (4 Apr 2005)
section 1.2.1 "Token granularity" second para through the end
says:

A scanner might take one of two approaches for assigning token
units to the character stream:
. . .
[Definition: Long tokens. Using this approach, declare namespace
would be considered a single token.] In this case, a parser that
has a look-ahead of only one token can be implemented.

This passage is not definitive because there is no definition
of "token", but some readers might reasonably think that it
means that the lexer does not expect to encounter a comment
between "declare" and "namespace".  That impression is
corroborated by section 2.1.1 "XQuery lexical states", first
table "the DEFAULT state", second row, which lists the pattern
<"declare" "namespace">.  This impression is re-inforced by the
fact that the tables do contain explicit provision for
supporting some comments, for example, the row later in the
DEFAULT table that contains the pattern "(:", and many other
state tables.  Thus one can argue that if the intent was to
permit comments in long tokens, there would be explicit support
for them in these tables.  Since the tables clearly do not support
comments in long tokens, there is no need for an implementation
to support them either.

2. The opposite opinion is that comments are permitted within
long tokens.  Evidence:  XQuery language spec (4 April 2005),
section 2.6 "Comments", says
"A comment may be used anywhere ignorable whitespace
is allowed."  The hot link for "ignorable whitespace" takes
you to A.2.2 "Whitespace rules", which says:

[Definition: Unless otherwise specified (see A.2.2.2 Explicit
Whitespace Handling), Ignorable whitespace may occur
between terminals, and is not significant to the parse tree.
For readability, whitespace may be used in most expressions
even though not explicitly notated in the EBNF. All allowable
whitespace that is not explicitly specified in the EBNF is ignorable
whitespace, and converse, this term does not apply to whitespace
that is explicitly specified. ]  ... Comments may also act as
"whitespace" to prevent two adjacent terminals from being
recognized as one.

The hot link for "terminal" takes you to the following definition:

[Definition: A terminal is a single unit of the grammar that can not
be further subdivided, and is specified in the EBNF by a character
or characters in quotes, or a regular expression.]

The relevant EBNF for my running example is

[10] NamespaceDecl :: <"declare" "namespace"> ...

By the definition of terminal, "declare" and "namespace" are two
terminals.  EBNF [10] is not marked as "explicit whitespace",
therefore comments are permitted between "declare" and
"namespace".

Why this is important: this is a serious usability issue.
If users cannot put comments between terminals in long tokens,
they will need to be careful where they put comments in their
XQuery expressions.  They will probably need a reference card,
since there is such an extensive list of long tokens.  In
addition, they will be prevented from placing comments in some
of the most natural places.

Aggravating the situation, some vendors will permit comments
in long tokens, even if XQuery does not.  This will lead their
users to write nonportable XQuery expressions, which will cause
syntax errors when supposedly debugged applications are migrated,
or simply deployed into a heterogenous environment.

Proposed solution: comments are permitted within long tokens.
The definition of "long token" in section 1.2.1 should be 
enhanced with a statement that comments are permitted between
the "subtoken"s of a long token, such as "declare" and "namespace".
In addition, the lexical state tables in section 2.1.1 should be
enhanced to handle comments in long tokens.  An idea
for doing this is to define a pattern for ignorable whitespace,
in the same fashion that the tables presume a pattern called
QName.  Let us call this pattern IW.  Given such a pattern, then
the actual long token is <"declare" IW "namespace">.  

Note that if we have a pattern for ignorable whitespace, then 
the current rows for (: and (# do not belong in the tables,
since comments and pragmas are now handled by the IW pattern.

Since IW is not a regular expression, owing to the ability to
nest comments, the specification should also give the reader
guidance on how to recognize IW.  IW can be recognized by a 
stack machine, so the current set of rules for handling (:
and (# could be placed in an entirely new set of tables,
which describe only IW.  Note that this idea implies that the 
complete lexer is running a low-level stack automoton to
detect IW, and then a high-level stack automoton as described 
in section 2.1.1.  The rules for IW should be in a separate 
section from 2.1.1 to make clear that they form a preliminary
stage to the lexer, before the final stage.

Alternatively, if the two-stack design is not agreeable, 
then a single stack can be used, at the cost of a lot more
states.  For example, the pattern <"declare" "(:"> needs to
enter a state that looks for the matching :) after which
it can pop and continue looking for the word to come after
"declare".  If it does not find an appropriate word, then
it can rewind the scan and decide that "declare" was not
a keyword after all.  You need a separate state for every
juncture that a comment might appear, so that you can keep
track of how much of a long token has already been recognized.
Personally, I think the number of states would be prohibitive.

Received on Friday, 15 July 2005 00:50:21 UTC