Re: [Bug 10008] New: Use of Unicode blocks that no longer exist in regular expressions. from C. M. Sperberg-McQueen on 2010-06-24 (www-xml-schema-comments@w3.org from April to June 2010)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Thu, 24 Jun 2010 08:11:35 -0600
To: oliver@cbcl.co.uk
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, www-xml-schema-comments@w3.org
Message-Id: <A32EA3B8-D5A3-4853-8C1A-4EB8275960DE@blackmesatech.com>
On 24 Jun 2010, at 05:47 , bugzilla@jessica.w3.org wrote:

> http://www.w3.org/Bugs/Public/show_bug.cgi?id=10008
>
>           Summary: Use of Unicode blocks that no longer exist in  
> regular
>                    expressions.
>           Product: XML Schema
>           Version: 1.0/1.1 both
>          Platform: PC
>        OS/Version: Windows NT
>            Status: NEW
>          Severity: normal
>          Priority: P2
>         Component: Datatypes: XSD Part 2
>        AssignedTo: David_E3@VERIFONE.com
>        ReportedBy: oliver@cbcl.co.uk
>         QAContact: www-xml-schema-comments@w3.org
>                CC: cmsmcq@blackmesatech.com
>
>
> Section F.1.1 states the following:
>
> Note:  [Unicode Database] is subject to future revision. For  
> example, the
> grouping of code points into blocks might be updated. All ·minimally
> conforming· processors ·must· support the blocks defined in the  
> version of
> [Unicode Database] that is current at the time this specification  
> became a W3C
> Recommendation. However, implementors are encouraged to support the  
> blocks
> defined in any future version of the Unicode Standard.
>
> Unfortunately some of these blocks no longer exist in the current  
> Unicode
> specification!  I believe the changes are limited to the following:
>
> CombiningMarksforSymbols is now CombiningDiacriticalMarksforSymbols
>
> Greek is now GreekandCoptic
>
> PrivateUse has been split into three groups (we think):
> PrivateUseArea, SupplementaryPrivateUseAreaA and  
> SupplementaryPrivateUseAreaB.
>
> The behaviour for these old group names is left a bit vague.  I  
> suggest that
> the correct behaviour should be one of the following, but this is  
> not specified
> anywhere:
>
> 1) The old block names should no longer be valid.  This is a direct
> contradiction with the specification and would cause compatibility  
> problems.
>
> 2) The old names should refer to groups in an older version of the  
> Unicode
> specification that did have them.  In particular I suggest that this  
> should be
> the version used in the Schema specification.
>
> 3) The old names should map to the equivalent groups in the newer  
> version of
> the specification.  I can't find this mapping specified anywhere,  
> but I believe
> it to be as described above (at least for the current version).

For XSD 1.1, I think the answer is given partly by the sentences  
quoted above
and partly by the following words in section G.1.1 Character Class  
Escapes,
which follow them (and were added in 1.1):

     When the implementation supports multiple versions of the Unicode  
database,
     and they differ in salient respects (e.g. different properties  
are assigned
     to the same character in different versions of the database),  
then it
     is ·implementation-defined· which set of property definitions is  
used
     for any given assessment episode.

XSD 1.0 does not use the explicit terms 'implementation-defined' and
'implementation-dependent'; these were introduced by XSD 1.1 (thanks
to the example of the QT specs, and SQL).  But I think it would be
reasonable to take the added words in 1.1 as a clarification, not
a change to the design, of 1.0, and infer:

- You can support the normatively referenced version of the Unicode
database.

- You are encouraged to support later version of the database as well.

   [N.B. 'as well', not 'instead'.  Some have suggested that no
   implementers with space constraints and a brain will be willing to
   support the old db as well as the current db supported by their  
underlying
   Unicode libraries.  Some have suggested the spec is foolish to tell  
them
   they must.  And some have suggested that if implementors ignore  
foolish
   rules in the spec, they won't be the first to do so.  Enough said.]

- For any given regex, it seems clear that you have to interpret it
   in a way consistent with some one version of the Unicode database.

- How you choose which version should guide your interpretation is
up to you.  You could offer a run-time option to the user.  You could
decide that if they use the block name 'Greek' they must mean you to
use a version of the database that has a block named 'Greek'.  You
could flip a coin.

XSD 1.1 requires you to document how you determine which version of
the database to use in interpreting block names.  It does not, as far
as I can see, require anything further.  (It does not, for example,
appear to require that you always use the same version within a given
validation, though as a user I think I'd rather that you did.)

XSD 1.0 does not require you to document anything.  (Although as
a user I'd rather that you did.)

Personally, I think the behaviors you label 1 and 2 are both consistent
with the 1.0 and 1.1 specs.  I think behavior 3 may be a little
suspect (at least, it is not what I as a reader think the spec has
in mind when it talks about support for multiple versions of the
Unicode database), but I also do not think it would be easy to prove
conclusively that behavior 3 is forbidden by either 1.0 or 1.1.
As a user, I personally would be happiest with some reasonable
default (always new, always old, use the one that makes the regex
valid, or perhaps something else) and a way to force a particular
behavior.  But then, as a user I always like to be able to control
what happens.

I hope this helps.

-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************
Received on Thursday, 24 June 2010 14:12:21 UTC