- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Wed, 11 Jan 2006 00:01:07 +0100
- To: www-xml-schema-comments@w3.org, public-qt-comments@w3.org
- Cc: www-international@w3.org
Dear XML Schema Working Group, Dear XML Query Working Group, Dear XSL Working Group, http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#regexs apparently fails to provide Basic Unicode Support as defined in UTS #18 as it does not meet RL1.1. Meeting the requirements defined for this conformance level is however a stated goal of the format. RL1.1 requires provisions to refer to any Unicode code point. XML Schema requires however to rely on external provisions to refer to characters, which in case of XML 1.0 means e.g. U+0001 cannot be referred to, and in case of XML 1.1 e.g. U+FFFE cannot be referred to. Other formats likely have similar restrictions. It is thus not possible to express in an XML 1.0 document e.g. that any character except the form feed U+000C may be used; in Perl this would be either [^\x0C] or [\x00-\x0B\x0D-\x{10FFFF}] and XML 1.0 does not allow use of U+000C or U+000B, XML 1.1 would have to be used for the schema. A similar problem is allowing any character but U+0000-U+001F, in Perl this would be [^\x00-\x1F]; for a schema in XML 1.1 this would need to be rewritten to xml="[^-]" and for a schema in XML 1.0 to xml="[^	

]" which when applied to a XML 1.1 document would produce incorrect results. To exclude e.g. code points designated for private use in Perl would be [^\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{10000}-\x{10FFFD}]. To express this with the regular expression format in XML Schema 1.0 one would have to use private use code points which one should not per W3C's character model, or negate the class first to [\x00-...] and then substract the characters disallowed in the version of XML in use, and if that's not XML 1.1 you run into the same problem as above. The effect is that this design discourages sharing regular expressions, developers have to be aware of these subtle problems and convert between them by adding and subtracting character ranges, which is not unlikely to either introduce errors or persuade schema authors to use incorrect expressions so as to not depend on XML 1.1 support of schema validators (or query processors, or whatever invokes the engine). It is important to note a regular expression as in xml="[^-]" is not actually allowed by XML Schema 1.0, as the definition excludes the characters regardless of whether the XML version in use allows them or not. With XML 1.1 support beeing suboptimal and support for XML 1.1 or the relevant productions where XML support is not relevant beeing optional, the most likely result is that schemas, queries, etc. are authored that do not properly represent constraints for XML 1.1 documents. It's important to realize that depending on use of Unicode code points to refer to Unicode code points is a major design flaw; as pointed out above, this implies a requirement to either use code points one does not intend to use or to resort to awkward workarounds to refer to them by not mentioning them! The requirement RL1.1 cited above is wisely chosen so as to avoid precisely these problems. http://www.w3.org/TR/2005/CR-xpath-functions-20051103/ and XML Schema 1.1 http://www.w3.org/TR/2005/WD-xmlschema11-2-20050224/ should be changed to conform to RL1.1 by introducing a new way to refer to Unicode code points wherever SingleCharEsc is allowed. The \x{....} syntax I used here would be sufficient, allowing \x.. would be a plus. Ideally there would be a stand-alone Technical Report that defines this regex format including all relevant extensions and the various specifications simply refer to that, pointing out the subset in use rather than extending that format. This syntax will also make it considerably simpler to avoid conversion problems when converting a regular expressions in other formats for use in XML Schema or other formats that use a similar expression language, and help to protect against unexpected behavior of engines e.g. when the regular expression is included in an attribute value and thus subject to attribute value normalization. Extending a regular expression parser as proposed above is considerably simpler for implementers than converting regular expressions to avoid the problems I cited here for developers. Converting special characters to use \x{....} syntax can be done using a simple regular expression, while finding special characters and problematic character classes would require more extensive parsing of the expression and implementation of set addition and subtraction. regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 10 January 2006 23:00:39 UTC