Erratum/Request for clarification on Regexp notation from Martin Duerst on 2001-06-18 (www-xml-schema-comments@w3.org from April to June 2001)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 18 Jun 2001 14:26:54 +0900
To: www-xml-schema-comments@w3.org
Cc: Paul.V.Biron@kp.org, ashokma@microsoft.com, cmsmcq@w3.org, ht@cogsci.ed.ac.uk (Henry S. Thompson), dc@w3.org, w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20010615111809.05f37cd0@sh.w3.mag.keio.ac.jp>

[To the I18N IG participants:
Please note that the www-xml-schema-comments@w3.org list is publicly archived.]

Appendix F, Regular Expressions, of Schema part 2
(http://www.w3.org/TR/xmlschema-2/#regexs)
contains the following rule:

[19]    XmlCharRef    ::=    ( '&#' [0-9]+ ';' ) | (' &#x' [0-9a-fA-F]+ ';' )


At first, this looks very obvious, it's very convenient to be able
to use numeric character references in regular expressions. But the
more one thinks about it, the more confusing it gets. There are
various questions that the spec doesn't answer clearly:

- Are the regular expressions in App F defined on the XML Infoset or
   on XML syntax? Given the text in the section about the pattern
   facet, and the overall structure of the schema spec, assuming the
   former would be straightforward, but the infoset doesn't contain
   any numeric character references.

- If the regular expression grammar is based on the infoset, does
   this mean that it's acutally possible to write (in the xml source)
   something like &amp;&#x23;&#x78;&#x35;&#x33;&#x31;&#x3B; to get
   &#x531; (Armenian capital letter ayb)?

- Why is it possible to use an 'XMLCharRef' as a single character
   in a character range, but not as the start or end of an actual
   character range?

- Is it forbidden to use numeric character references for syntactically
   relevant characters (e.g. metacharacters) in patterns? E.g. would
   "a&#x2A;" in the XML source be illegal? If yes, how could a schema
   processor built on the infoset check this? Would it mean the same thing
   as "a*"? Or would it mean the same thing as "a\*"? Again, how would
   a schema processor built on the infoset do that?

- Would it be possible to use "a&amp;#x2A;" in the XML source instead
   of "a\*"? In the infoset, this would give "a&#x2A;", which according
   to the grammar would be the same as "a\*". Is this necessary?

One can continue to construct more such examples and questions,
but it only gets more and more confusing.

I think that what makes most sense, and was most probably intended,
was that the conversion from the xml source to the infoset takes
care of numeric character references, and that the pattern in the
infoset doesn't contain any numeric character references anymore,
and is just a sequence of Unicode characters. The only characters
still encaped in that sequence are the regexp metacharacters.

In order to make this clear, an erratum including the following
changes should be issued:

- Remove production [19] from Appendix F.
- In production [17], remove the term 'XmlCharRef' and an or bar.
- In order to make things clear, a note may be added just after
   the definition of 'Single Character Escape', as follows:

   Note: The regular expression syntax defined here does not include
         syntax for escaping arbitrary Unicode characters. In the XML
         representation of a pattern, XML numeric character references
         can be used to denote arbitrary Unicode characters. However,
         please note that these can be used for all characters in the
         regular expression, including metacharacters, because they
         are resolved before parsing the regular expression. See the
         examples in the following table:

         XML representation     Infoset        Meaning
         a&#x2A;                a*             arbitrary number of 'a'
         a\&#x2A;               a\*            'a', followed by '*'


Regards,    Martin.

Received on Monday, 18 June 2001 01:27:10 UTC