[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From mike@saxonica.com  2005-09-14 12:56 -------
(This is a short proposal, but it's the result of a lot of work - the waste bin
is full of my failed attempts. It's packed with meaning and needs to be read
very carefully, with a close eye on the syntax in Schema Part 2.)


PROPOSAL

The detailed rules for the effect of the "i" flag are as follows. In these
rules, one character is considered to be a *case-variant* of another character
if there is a default case mapping between the two characters as defined in
section 3.13 of [The Unicode Standard]. Note that the case-variants of a
character under this definition are always single characters.

1. When a normal character (Char) is used as an atom, it represents the set
containing that character and all its case-variants. For example, the regular
expression "z" expands to "[zZ]".

2. A character range (charRange) represents the set containing all the
characters that it would match in the absence of the "i" flag, together with
their case-variants. For example, "[A-Z]" expands to "[A-Za-z]". This rule
applies also to a character range used in a character class subtraction
(charClassSub): thus [A-Z-[IO]] expands to [A-Za-z-[IOio]]. It also applies to a
character range used as part of a negative character group: thus [^Q] expands to
[^Qq].

3. A back-reference is compared using case-blind comparison: that is, each
character must either be the same as the corresponding character of the
previously matched string, or must be a case-variant of that character. For
example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
expression "([md])[aeiou]\1" when the "i" flag is used.

4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}"
continues to match upper-case letters only.

  
Michael Kay

Received on Wednesday, 14 September 2005 12:58:16 UTC