[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From mike@saxonica.com  2005-09-15 09:00 -------
Let's now address David's concern about how we define case-variants. I suggest
that rather than appealing directly to Unicode, we instead define it in terms of
our own lower-case() and upper-case() functions (which are themselves defined in
terms of Unicode). This seems to give a better chance of getting them consistent.

The rule that seems to work is:

For characters C1 and C2, considered as strings of length one, C1 is a
case-variant of C2 if (fn:lower-case(C1) eq fn:lower-case(C2) or
fn:upper-case(C1) eq fn:upper-case(C2)) when compared using the Unicode
codepoint collation.

Under this rule, x212A (Kelvin sign) is a case-variant of "k" and also of "K".

So this leads to the revised proposal as follows:

PROPOSAL v2

The detailed rules for the effect of the "i" flag are as follows. In these
rules, one character C2 is considered to be a *case-variant* of another
character C1 if the following XPath expression returns true, when the two
characters are considered as strings of length one, and the Unicode codepoint
collation is used:

fn:lower-case(C1) eq fn:lower-case(C2) 
  or 
fn:upper-case(C1) eq fn:upper-case(C2)

Note that the case-variants of a character under this definition are always
single characters.

1. When a normal character (Char) is used as an atom, it represents the set
containing that character and all its case-variants. For example, the regular
expression "z" will match both "z" and "Z".

2. A character range (charRange) represents the set containing all the
characters that it would match in the absence of the "i" flag, together with
their case-variants. For example, the regular expression "[A-Z]" will match all
the letters A-Z and all the letters a-z. It will also match certain other
characters such as x212A (KELVIN SIGN), since fn:lower-case("&#x212A") is "k". 

This rule applies also to a character range used in a character class
subtraction (charClassSub): thus [A-Z-[IO]] will match characters such as "A",
"B", "a", and "b", but will not match "I", "O", "i", or "o". 

The rule also applies to a character range used as part of a negative character
group: thus [^Q] will match every character except "Q" and "q" (these being the
only case-variants of "Q" in Unicode).

3. A back-reference is compared using case-blind comparison: that is, each
character must either be the same as the corresponding character of the
previously matched string, or must be a case-variant of that character. For
example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
expression "([md])[aeiou]\1" when the "i" flag is used.

4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}"
continues to match upper-case letters only.

  
Michael Kay

Received on Thursday, 15 September 2005 09:00:26 UTC