- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Thu, 24 Jun 2010 08:11:35 -0600
- To: oliver@cbcl.co.uk
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, www-xml-schema-comments@w3.org
On 24 Jun 2010, at 05:47 , bugzilla@jessica.w3.org wrote: > http://www.w3.org/Bugs/Public/show_bug.cgi?id=10008 > > Summary: Use of Unicode blocks that no longer exist in > regular > expressions. > Product: XML Schema > Version: 1.0/1.1 both > Platform: PC > OS/Version: Windows NT > Status: NEW > Severity: normal > Priority: P2 > Component: Datatypes: XSD Part 2 > AssignedTo: David_E3@VERIFONE.com > ReportedBy: oliver@cbcl.co.uk > QAContact: www-xml-schema-comments@w3.org > CC: cmsmcq@blackmesatech.com > > > Section F.1.1 states the following: > > Note: [Unicode Database] is subject to future revision. For > example, the > grouping of code points into blocks might be updated. All ˇminimally > conformingˇ processors ˇmustˇ support the blocks defined in the > version of > [Unicode Database] that is current at the time this specification > became a W3C > Recommendation. However, implementors are encouraged to support the > blocks > defined in any future version of the Unicode Standard. > > Unfortunately some of these blocks no longer exist in the current > Unicode > specification! I believe the changes are limited to the following: > > CombiningMarksforSymbols is now CombiningDiacriticalMarksforSymbols > > Greek is now GreekandCoptic > > PrivateUse has been split into three groups (we think): > PrivateUseArea, SupplementaryPrivateUseAreaA and > SupplementaryPrivateUseAreaB. > > The behaviour for these old group names is left a bit vague. I > suggest that > the correct behaviour should be one of the following, but this is > not specified > anywhere: > > 1) The old block names should no longer be valid. This is a direct > contradiction with the specification and would cause compatibility > problems. > > 2) The old names should refer to groups in an older version of the > Unicode > specification that did have them. In particular I suggest that this > should be > the version used in the Schema specification. > > 3) The old names should map to the equivalent groups in the newer > version of > the specification. I can't find this mapping specified anywhere, > but I believe > it to be as described above (at least for the current version). For XSD 1.1, I think the answer is given partly by the sentences quoted above and partly by the following words in section G.1.1 Character Class Escapes, which follow them (and were added in 1.1): When the implementation supports multiple versions of the Unicode database, and they differ in salient respects (e.g. different properties are assigned to the same character in different versions of the database), then it is ˇimplementation-definedˇ which set of property definitions is used for any given assessment episode. XSD 1.0 does not use the explicit terms 'implementation-defined' and 'implementation-dependent'; these were introduced by XSD 1.1 (thanks to the example of the QT specs, and SQL). But I think it would be reasonable to take the added words in 1.1 as a clarification, not a change to the design, of 1.0, and infer: - You can support the normatively referenced version of the Unicode database. - You are encouraged to support later version of the database as well. [N.B. 'as well', not 'instead'. Some have suggested that no implementers with space constraints and a brain will be willing to support the old db as well as the current db supported by their underlying Unicode libraries. Some have suggested the spec is foolish to tell them they must. And some have suggested that if implementors ignore foolish rules in the spec, they won't be the first to do so. Enough said.] - For any given regex, it seems clear that you have to interpret it in a way consistent with some one version of the Unicode database. - How you choose which version should guide your interpretation is up to you. You could offer a run-time option to the user. You could decide that if they use the block name 'Greek' they must mean you to use a version of the database that has a block named 'Greek'. You could flip a coin. XSD 1.1 requires you to document how you determine which version of the database to use in interpreting block names. It does not, as far as I can see, require anything further. (It does not, for example, appear to require that you always use the same version within a given validation, though as a user I think I'd rather that you did.) XSD 1.0 does not require you to document anything. (Although as a user I'd rather that you did.) Personally, I think the behaviors you label 1 and 2 are both consistent with the 1.0 and 1.1 specs. I think behavior 3 may be a little suspect (at least, it is not what I as a reader think the spec has in mind when it talks about support for multiple versions of the Unicode database), but I also do not think it would be easy to prove conclusively that behavior 3 is forbidden by either 1.0 or 1.1. As a user, I personally would be happiest with some reasonable default (always new, always old, use the one that makes the regex valid, or perhaps something else) and a way to force a particular behavior. But then, as a user I always like to be able to control what happens. I hope this helps. -- **************************************************************** * C. M. Sperberg-McQueen, Black Mesa Technologies LLC * http://www.blackmesatech.com * http://cmsmcq.com/mib * http://balisage.net ****************************************************************
Received on Thursday, 24 June 2010 14:12:21 UTC