W3C home > Mailing lists > Public > xmlschema-dev@w3.org > May 2011

Re: Is every XML Schema validator guaranteed to support the same set of Unicode characters?

From: Michael Kay <mike@saxonica.com>
Date: Mon, 16 May 2011 22:07:55 +0100
Message-ID: <4DD1922B.80508@saxonica.com>
To: xmlschema-dev@w3.org
On 16/05/2011 19:54, Costello, Roger L. wrote:
> Hi Folks,
>
> 1. Is every XML Schema validator guaranteed to support the same set of Unicode characters?
Firstly, let's assume we are talking about conformant XSD processors. 
There are many that aren't conformant, and regex support is a notorious 
black spot for this.

As far as conformant processors are concerned, the spec offers 
implementors freedom to choose which version of Unicode they will 
support. So if the definitions of character groups like Nd change from 
one Unicode version to the next, this may be reflected in differences 
between schema processors. In practice this is only likely to affect you 
if you are on the bleeding edge of the Unicode repertoire.
>
>
> 2. Is every version of XML Schema guaranteed to support the same set of Unicode characters as all other versions?
See above.
>
> 3. Does XML determine the set of characters supported by XML Schema? That is, does XML Schema support the set of Unicode characters specified in the XML specification?
Yes - but XML itself allows new characters when Unicode adds them.
>
> 4. If I use this regex in my XML Schema:
>
>        [^0-9]*
>
> Is there a risk that:
>
> a. The set of strings described by the regex may vary, depending on the XML Schema validator (or an XML Schema application)?
>
> b. With different versions of XML Schema (e.g., XML Schema 1.0, XML Schema 1.1) the regex may describe different sets of strings?
>
No, for a simple regex like this you'll get the same results with every 
processor. Even non-conformant processors, unless they're pathological.

Michael Kay
Saxonica
Received on Monday, 16 May 2011 21:08:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 16 May 2011 21:08:20 GMT