- From: <bugzilla@wiggum.w3.org>
- Date: Thu, 02 Jun 2005 00:29:25 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1373 ------- Additional Comments From cmsmcq@w3.org 2005-06-02 00:29 ------- Scott Boag writes: I'm curious as to why, in the XML spec, there is: [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? vs. [24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"') Section 6 states "Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lowercase letter." But it seems like a fuzzy line. The XML WG may have made errors in drawing the line, but whether a particular language over the alphabet of Unicode characters is regular or not, in the technical sense, should not be fuzzy. The language defined by a non-terminal in a regular right-part grammar is regular if and only if non-terminals on the right-hand side can be replaced (iteratively) until there is nothing there but terminal symbols (in this case, Unicode characters or expressions like [a-zA-Z]). This, in turn, is the case if there is no recursion in the grammar rules (no left-hand symbol turns up directly or indirectly in its own right-hand side). If all the symbols in a right-hand side are known to denote regular languages, then the symbol on the left-hand side also denotes a regular language; if any symbol on the right denotes a non-regular language, then the language of the left-hand side symbol is non-regular. Consider the examples above. The language defined by using the XML 1.0 grammar with 'doctypedecl' as start symbol (I'll just call this 'the language of doctypedecl' or 'the language denoted by doctypedecl' in what follows) is not regular, because a doctypedecl can contain an internal subset (intSubset), which can contain element declarations (via markupdecl and elementdecl), which can contain content models for elements with element content (via contentspec and children). Such content models are not regular, because they require that opening and closing parentheses match; there is indirect recursion in both choice, and seq, through cp. (Content models for mixed content are regular because they can't have nested groups.) So 'doctypedecl' is spelled with an initial lowercase letter. 'VersionInfo', by contrast, has an initial uppercase because it denotes a regular language: it can be written [24] VersionInfo ::= (#x20 | #x9 | #xD | #xA)+ 'version' (#x20 | #x9 | #xD | #xA)+? '=' (#x20 | #x9 | #xD | #xA)+? ("'" '1.0' "'" | '"' '1.0' '"') which has no non-terminals on the right-hand side. That may not be the 'why' you had in mind, though. The distinction between regular and non-regular non-terminals was the result of a compromise. Someone (I'll leave the protagonists anonymous) proposed that it would be easier to see how to write an XML parser if we distinguished the lexical level and the grammar level explicitly, so that interested parties could see at a glance where one might plausibly draw the line between a lexer and a parser. Even if an implementor later decided to move that line, it would be convenient to have an initial suggestion. Someone else objected that different implementors might choose to draw the lexer/parser line in different places, and that trying to prescribe it, or even making a specific suggestion, was a waste of time. In the end, we agreed to distinguish regular from non-regular sublanguages, on the theory that conventional lexers typically recognize only regular languages. The initial capital letter effectively says "If you want to, you can conveniently treat this non-terminal as a terminal symbol recognized by the lexer"; perhaps even more important, the initial lowercase letter says "If you were thinking of treating this as a terminal symbol, using a conventional lexer design, then forget it". I gather that when XPath 1.0 was done, the XSL WG had no one who thought that this was a worthwhile way to help implementors or readers. Myself, I find it helpful but not essential.
Received on Thursday, 2 June 2005 00:29:25 UTC