- From: <bugzilla@wiggum.w3.org>
- Date: Fri, 15 Jul 2005 00:50:15 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1617 Summary: how are comments really parsed? Product: XPath / XQuery / XSLT Version: Last Call drafts Platform: PC OS/Version: Windows 2000 Status: NEW Severity: normal Priority: P2 Component: XQuery/XPath Tokenizer AssignedTo: scott_boag@us.ibm.com ReportedBy: fred.zemke@oracle.com QAContact: public-qt-comments@w3.org How to parse comments is still not clear, and is open to conflicting interpretations. The question is whether comments count as whitespace within a so-called "long token". The two possible interpretations are: 1. no, comments are not permitted in long tokens. Evidence: "Building a tokenizer for XPath and XQuery" (4 Apr 2005) section 1.2.1 "Token granularity" second para through the end says: A scanner might take one of two approaches for assigning token units to the character stream: . . . [Definition: Long tokens. Using this approach, declare namespace would be considered a single token.] In this case, a parser that has a look-ahead of only one token can be implemented. This passage is not definitive because there is no definition of "token", but some readers might reasonably think that it means that the lexer does not expect to encounter a comment between "declare" and "namespace". That impression is corroborated by section 2.1.1 "XQuery lexical states", first table "the DEFAULT state", second row, which lists the pattern <"declare" "namespace">. This impression is re-inforced by the fact that the tables do contain explicit provision for supporting some comments, for example, the row later in the DEFAULT table that contains the pattern "(:", and many other state tables. Thus one can argue that if the intent was to permit comments in long tokens, there would be explicit support for them in these tables. Since the tables clearly do not support comments in long tokens, there is no need for an implementation to support them either. 2. The opposite opinion is that comments are permitted within long tokens. Evidence: XQuery language spec (4 April 2005), section 2.6 "Comments", says "A comment may be used anywhere ignorable whitespace is allowed." The hot link for "ignorable whitespace" takes you to A.2.2 "Whitespace rules", which says: [Definition: Unless otherwise specified (see A.2.2.2 Explicit Whitespace Handling), Ignorable whitespace may occur between terminals, and is not significant to the parse tree. For readability, whitespace may be used in most expressions even though not explicitly notated in the EBNF. All allowable whitespace that is not explicitly specified in the EBNF is ignorable whitespace, and converse, this term does not apply to whitespace that is explicitly specified. ] ... Comments may also act as "whitespace" to prevent two adjacent terminals from being recognized as one. The hot link for "terminal" takes you to the following definition: [Definition: A terminal is a single unit of the grammar that can not be further subdivided, and is specified in the EBNF by a character or characters in quotes, or a regular expression.] The relevant EBNF for my running example is [10] NamespaceDecl :: <"declare" "namespace"> ... By the definition of terminal, "declare" and "namespace" are two terminals. EBNF [10] is not marked as "explicit whitespace", therefore comments are permitted between "declare" and "namespace". Why this is important: this is a serious usability issue. If users cannot put comments between terminals in long tokens, they will need to be careful where they put comments in their XQuery expressions. They will probably need a reference card, since there is such an extensive list of long tokens. In addition, they will be prevented from placing comments in some of the most natural places. Aggravating the situation, some vendors will permit comments in long tokens, even if XQuery does not. This will lead their users to write nonportable XQuery expressions, which will cause syntax errors when supposedly debugged applications are migrated, or simply deployed into a heterogenous environment. Proposed solution: comments are permitted within long tokens. The definition of "long token" in section 1.2.1 should be enhanced with a statement that comments are permitted between the "subtoken"s of a long token, such as "declare" and "namespace". In addition, the lexical state tables in section 2.1.1 should be enhanced to handle comments in long tokens. An idea for doing this is to define a pattern for ignorable whitespace, in the same fashion that the tables presume a pattern called QName. Let us call this pattern IW. Given such a pattern, then the actual long token is <"declare" IW "namespace">. Note that if we have a pattern for ignorable whitespace, then the current rows for (: and (# do not belong in the tables, since comments and pragmas are now handled by the IW pattern. Since IW is not a regular expression, owing to the ability to nest comments, the specification should also give the reader guidance on how to recognize IW. IW can be recognized by a stack machine, so the current set of rules for handling (: and (# could be placed in an entirely new set of tables, which describe only IW. Note that this idea implies that the complete lexer is running a low-level stack automoton to detect IW, and then a high-level stack automoton as described in section 2.1.1. The rules for IW should be in a separate section from 2.1.1 to make clear that they form a preliminary stage to the lexer, before the final stage. Alternatively, if the two-stack design is not agreeable, then a single stack can be used, at the cost of a lot more states. For example, the pattern <"declare" "(:"> needs to enter a state that looks for the matching :) after which it can pop and continue looking for the word to come after "declare". If it does not find an appropriate word, then it can rewind the scan and decide that "declare" was not a keyword after all. You need a separate state for every juncture that a comment might appear, so that you can keep track of how much of a long token has already been recognized. Personally, I think the number of states would be prohibitive.
Received on Friday, 15 July 2005 00:50:21 UTC