- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 18 Jun 2001 14:26:54 +0900
- To: www-xml-schema-comments@w3.org
- Cc: Paul.V.Biron@kp.org, ashokma@microsoft.com, cmsmcq@w3.org, ht@cogsci.ed.ac.uk (Henry S. Thompson), dc@w3.org, w3c-i18n-ig@w3.org
[To the I18N IG participants: Please note that the www-xml-schema-comments@w3.org list is publicly archived.] Appendix F, Regular Expressions, of Schema part 2 (http://www.w3.org/TR/xmlschema-2/#regexs) contains the following rule: [19] XmlCharRef ::= ( '&#' [0-9]+ ';' ) | (' &#x' [0-9a-fA-F]+ ';' ) At first, this looks very obvious, it's very convenient to be able to use numeric character references in regular expressions. But the more one thinks about it, the more confusing it gets. There are various questions that the spec doesn't answer clearly: - Are the regular expressions in App F defined on the XML Infoset or on XML syntax? Given the text in the section about the pattern facet, and the overall structure of the schema spec, assuming the former would be straightforward, but the infoset doesn't contain any numeric character references. - If the regular expression grammar is based on the infoset, does this mean that it's acutally possible to write (in the xml source) something like &#x531; to get Ա (Armenian capital letter ayb)? - Why is it possible to use an 'XMLCharRef' as a single character in a character range, but not as the start or end of an actual character range? - Is it forbidden to use numeric character references for syntactically relevant characters (e.g. metacharacters) in patterns? E.g. would "a*" in the XML source be illegal? If yes, how could a schema processor built on the infoset check this? Would it mean the same thing as "a*"? Or would it mean the same thing as "a\*"? Again, how would a schema processor built on the infoset do that? - Would it be possible to use "a&#x2A;" in the XML source instead of "a\*"? In the infoset, this would give "a*", which according to the grammar would be the same as "a\*". Is this necessary? One can continue to construct more such examples and questions, but it only gets more and more confusing. I think that what makes most sense, and was most probably intended, was that the conversion from the xml source to the infoset takes care of numeric character references, and that the pattern in the infoset doesn't contain any numeric character references anymore, and is just a sequence of Unicode characters. The only characters still encaped in that sequence are the regexp metacharacters. In order to make this clear, an erratum including the following changes should be issued: - Remove production [19] from Appendix F. - In production [17], remove the term 'XmlCharRef' and an or bar. - In order to make things clear, a note may be added just after the definition of 'Single Character Escape', as follows: Note: The regular expression syntax defined here does not include syntax for escaping arbitrary Unicode characters. In the XML representation of a pattern, XML numeric character references can be used to denote arbitrary Unicode characters. However, please note that these can be used for all characters in the regular expression, including metacharacters, because they are resolved before parsing the regular expression. See the examples in the following table: XML representation Infoset Meaning a* a* arbitrary number of 'a' a\* a\* 'a', followed by '*' Regards, Martin.
Received on Monday, 18 June 2001 01:27:10 UTC