- From: <bugzilla@wiggum.w3.org>
- Date: Thu, 14 Jan 2010 12:38:59 +0000
- To: www-xml-schema-comments@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=8744
Summary: Regex characters classes C, L, M, etc
Product: XML Schema
Version: 1.0/1.1 both
Platform: PC
OS/Version: Windows NT
Status: NEW
Severity: normal
Priority: P2
Component: Datatypes: XSD Part 2
AssignedTo: David_E3@VERIFONE.com
ReportedBy: mike@saxonica.com
QAContact: www-xml-schema-comments@w3.org
CC: cmsmcq@blackmesatech.com
The specification states:
<quote>
[Definition:] [Unicode Database] specifies a number of possible values for the
"General Category" property and provides mappings from code points to specific
character properties. The set containing all characters that have property X,
can be identified with a category escape \p{X} . The complement of this set
is specified with the category escape \P{X} . ( [\P{X}] = [^\p{X}] ).
</quote>
It then gives a table purporting to show the values of "General Category" that
occur in Unicode 5.1. This includes single-character categories such as "C",
"L", and "M". As far as I can see, however, Unicode only defines the
two-character categories such as Ll, Lu, Mc and so on. The single-character
categories are an invention of the regex language, and therefore need to be
described in our specification, rather than by reference to Unicode.
There are two possible definitions of these categories, which give different
results.
At least one XML Schema implementation has interpreted the single-character
category X to be the union of all two-character categories starting with X, for
example C is the union of (Cc, Cf, Co, and Cn). However, another interpretation
(the one used by the Java regex library) is that it is the set of all
characters listed in the Unicode database as belonging to a category starting
with that letter. This gives a different result in the case of category C,
since Cn is the set of characters that are not listed in the relevant section
of the Unicode database.
--
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 14 January 2010 12:39:00 UTC