- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 18 Jun 2001 14:26:54 +0900
- To: www-xml-schema-comments@w3.org
- Cc: Paul.V.Biron@kp.org, ashokma@microsoft.com, cmsmcq@w3.org, ht@cogsci.ed.ac.uk (Henry S. Thompson), dc@w3.org, w3c-i18n-ig@w3.org
[To the I18N IG participants:
Please note that the www-xml-schema-comments@w3.org list is publicly archived.]
Appendix F, Regular Expressions, of Schema part 2
(http://www.w3.org/TR/xmlschema-2/#regexs)
contains the following rule:
[19] XmlCharRef ::= ( '&#' [0-9]+ ';' ) | (' &#x' [0-9a-fA-F]+ ';' )
At first, this looks very obvious, it's very convenient to be able
to use numeric character references in regular expressions. But the
more one thinks about it, the more confusing it gets. There are
various questions that the spec doesn't answer clearly:
- Are the regular expressions in App F defined on the XML Infoset or
on XML syntax? Given the text in the section about the pattern
facet, and the overall structure of the schema spec, assuming the
former would be straightforward, but the infoset doesn't contain
any numeric character references.
- If the regular expression grammar is based on the infoset, does
this mean that it's acutally possible to write (in the xml source)
something like &#x531; to get
Ա (Armenian capital letter ayb)?
- Why is it possible to use an 'XMLCharRef' as a single character
in a character range, but not as the start or end of an actual
character range?
- Is it forbidden to use numeric character references for syntactically
relevant characters (e.g. metacharacters) in patterns? E.g. would
"a*" in the XML source be illegal? If yes, how could a schema
processor built on the infoset check this? Would it mean the same thing
as "a*"? Or would it mean the same thing as "a\*"? Again, how would
a schema processor built on the infoset do that?
- Would it be possible to use "a&#x2A;" in the XML source instead
of "a\*"? In the infoset, this would give "a*", which according
to the grammar would be the same as "a\*". Is this necessary?
One can continue to construct more such examples and questions,
but it only gets more and more confusing.
I think that what makes most sense, and was most probably intended,
was that the conversion from the xml source to the infoset takes
care of numeric character references, and that the pattern in the
infoset doesn't contain any numeric character references anymore,
and is just a sequence of Unicode characters. The only characters
still encaped in that sequence are the regexp metacharacters.
In order to make this clear, an erratum including the following
changes should be issued:
- Remove production [19] from Appendix F.
- In production [17], remove the term 'XmlCharRef' and an or bar.
- In order to make things clear, a note may be added just after
the definition of 'Single Character Escape', as follows:
Note: The regular expression syntax defined here does not include
syntax for escaping arbitrary Unicode characters. In the XML
representation of a pattern, XML numeric character references
can be used to denote arbitrary Unicode characters. However,
please note that these can be used for all characters in the
regular expression, including metacharacters, because they
are resolved before parsing the regular expression. See the
examples in the following table:
XML representation Infoset Meaning
a* a* arbitrary number of 'a'
a\* a\* 'a', followed by '*'
Regards, Martin.
Received on Monday, 18 June 2001 01:27:10 UTC