Regular expressions should support \x{....} escapes

Dear XML Schema Working Group,
Dear XML Query Working Group,
Dear XSL Working Group, 

  http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#regexs apparently
fails to provide Basic Unicode Support as defined in UTS #18 as it does
not meet RL1.1. Meeting the requirements defined for this conformance
level is however a stated goal of the format.

RL1.1 requires provisions to refer to any Unicode code point. XML Schema
requires however to rely on external provisions to refer to characters,
which in case of XML 1.0 means e.g. U+0001 cannot be referred to, and in
case of XML 1.1 e.g. U+FFFE cannot be referred to. Other formats likely
have similar restrictions.

It is thus not possible to express in an XML 1.0 document e.g. that any
character except the form feed U+000C may be used; in Perl this would be
either [^\x0C] or [\x00-\x0B\x0D-\x{10FFFF}] and XML 1.0 does not allow
use of U+000C or U+000B, XML 1.1 would have to be used for the schema.

A similar problem is allowing any character but U+0000-U+001F, in Perl
this would be [^\x00-\x1F]; for a schema in XML 1.1 this would need to
be rewritten to xml="[^-]" and for a schema in XML 1.0 to
xml="[^	

]" which when applied to a XML 1.1 document would
produce incorrect results.

To exclude e.g. code points designated for private use in Perl would be
[^\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{10000}-\x{10FFFD}]. To express
this with the regular expression format in XML Schema 1.0 one would have
to use private use code points which one should not per W3C's character
model, or negate the class first to [\x00-...] and then substract the
characters disallowed in the version of XML in use, and if that's not
XML 1.1 you run into the same problem as above. 

The effect is that this design discourages sharing regular expressions,
developers have to be aware of these subtle problems and convert between
them by adding and subtracting character ranges, which is not unlikely
to either introduce errors or persuade schema authors to use incorrect
expressions so as to not depend on XML 1.1 support of schema validators
(or query processors, or whatever invokes the engine).

It is important to note a regular expression as in xml="[^-]"
is not actually allowed by XML Schema 1.0, as the definition excludes
the characters regardless of whether the XML version in use allows them
or not. With XML 1.1 support beeing suboptimal and support for XML 1.1
or the relevant productions where XML support is not relevant beeing
optional, the most likely result is that schemas, queries, etc. are
authored that do not properly represent constraints for XML 1.1
documents.

It's important to realize that depending on use of Unicode code points
to refer to Unicode code points is a major design flaw; as pointed out
above, this implies a requirement to either use code points one does not
intend to use or to resort to awkward workarounds to refer to them by
not mentioning them! The requirement RL1.1 cited above is wisely chosen
so as to avoid precisely these problems.

http://www.w3.org/TR/2005/CR-xpath-functions-20051103/ and XML Schema
1.1 http://www.w3.org/TR/2005/WD-xmlschema11-2-20050224/ should be
changed to conform to RL1.1 by introducing a new way to refer to Unicode
code points wherever SingleCharEsc is allowed. The \x{....} syntax I
used here would be sufficient, allowing \x.. would be a plus.

Ideally there would be a stand-alone Technical Report that defines this
regex format including all relevant extensions and the various
specifications simply refer to that, pointing out the subset in use
rather than extending that format.

This syntax will also make it considerably simpler to avoid conversion
problems when converting a regular expressions in other formats for use
in XML Schema or other formats that use a similar expression language,
and help to protect against unexpected behavior of engines e.g. when the
regular expression is included in an attribute value and thus subject to
attribute value normalization.

Extending a regular expression parser as proposed above is considerably
simpler for implementers than converting regular expressions to avoid
the problems I cited here for developers. Converting special characters
to use \x{....} syntax can be done using a simple regular expression,
while finding special characters and problematic character classes would
require more extensive parsing of the expression and implementation of
set addition and subtraction.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Tuesday, 10 January 2006 23:00:49 UTC