[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From holstege@mathling.com  2005-09-14 16:25 -------

First, I'd like to thank Michael for this proposal. It is certainly clear, and
while there are behaviours that are perhaps unexpected, I think that is
inevitable in this area. 

Acknowledging Michael's comments about the overflowing trashbin (and
contributing a few crumpled sheets there myself), I nevertheless find myself
unhappy with talking about "expanding" the regular expression and 
would prefer to shift to speaking about case-folding as applying to how the
input string is matched. 

>From an implementation point of view, expanding regular expressions has
to be done on a case-by-case basis (no pun intended!). While it doesn't make it
impossible to cache regular expressions (i.e. pre-analyze and parse them), 
it does make it trickier and less useful to do so, as the regular expression
itself is no longer a sufficient key to what the analyzed regular expression
is. 

A consequence of this shift would be that case-folding would apply uniformly,
so that, for example: 

    fn:matches( "d", "\p{Lu}", "i" ) = fn:matches( "d", "[A-Z]", "i" )

which is not the case under Michael's proposal. I would go on to argue that
it would be good if both of these were true.  One reason for making this so
is that Datatypes says that "\P{Lu}" == [^\p{Lu}] and therefore you get some
odd inconsistencies if you don't apply the case-folding to the category
escapes as well. 

All of which sums up to putting an obligation on my to come up with a 
counter-proposal. 

My general tack on this is to tweak two statements in XML Schema Datatypes
that define what set of strings a character denotes and what set of strings a
character class denotes.  But I think Michael's case by case exposition is
most excellent and clear, and so I continue with that, tweaking the verbiage to
avoid the "expands" phrasing, treating it as clarification because those two 
rules are sufficient, and adding the additional cases that Michael's proposal 
doesn't touch.

COUNTER-PROPOSAL:

The detailed rules for the effect of the "i" flag are as follows. In these
rules, one character is considered to be a *case-variant* of another character
if there is a default case mapping between the two characters as defined in
section 3.13 of [The Unicode Standard]. Note that the case-variants of a
character under this definition are always single characters.

The rules for regular expressions in [XML Schema Part 2: Datatypes Second
Edition] are modified under the influence of the "i" flag in the following way:

1. A normal character c denotes a set of strings that contains one
   single-character string "x" for each character x that is either c or a
   case-variant of c.  

2. A character class C denotes a set of strings that contains one
   single-character string "x" for each character x that is either in the class
   or is a case-variant of some character in the class. 

Specifically, the application of these rules means:
* When a normal character (Char) is used as an atom, it represents the set
  containing that character and all its case-variants. For example, the regular
  expression "z" matches the same set of characters as "[zZ]".

* A character range (charRange) is a character class, and therefore represents
  the set containing all the characters that it would match in the absence of
  the "i" flag, together with their case-variants. For example, "[A-Z]"
  matches the same set of characters as "[A-Za-z]". 

* A character range used in character class subtraction (charClassSub)
  also represents the set containing all the characters that it would match in
  the absense of the "i" flag, together with their case-variants. For example,
  "[A-Z-[IO]]" matches the same set of characters as "[A-Za-z-[IOio]]".

* A negative character group (negCharGroup) is also a character class and
  the same rule applies. For example, "[^Q]" matches the same set of
  characters as "[^Qq]". 

* A category escape (catEsc) is also a character class and the same rule 
  applies.  For example, "\p{Lu}" matches all the upper case letters and their
  case-variants, and thus the string "d" would match "\p{Lu}".

* A complement category escape (complEsc) is also a character class and the
  same rule applies.  For example, "\P{Lu}" matches all letters that are
  neither upper case nor one of those character's case variants. Therefore
  "d" would not match "\P{Lu}".

* The same rule applies to single-character (SingleCharEsc) and multi-character
  (MultiCharEsc) escapes, although in practice this will have no effect.

* A back-reference is compared using case-blind comparison: that is, each
  character must either be the same as the corresponding character of the
  previously matched string, or must be a case-variant of that character. For
  example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
  expression "([md])[aeiou]\1" when the "i" flag is used.

Received on Wednesday, 14 September 2005 16:26:02 UTC