- From: <bugzilla@wiggum.w3.org>
- Date: Thu, 15 Sep 2005 09:00:17 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850 ------- Additional Comments From mike@saxonica.com 2005-09-15 09:00 ------- Let's now address David's concern about how we define case-variants. I suggest that rather than appealing directly to Unicode, we instead define it in terms of our own lower-case() and upper-case() functions (which are themselves defined in terms of Unicode). This seems to give a better chance of getting them consistent. The rule that seems to work is: For characters C1 and C2, considered as strings of length one, C1 is a case-variant of C2 if (fn:lower-case(C1) eq fn:lower-case(C2) or fn:upper-case(C1) eq fn:upper-case(C2)) when compared using the Unicode codepoint collation. Under this rule, x212A (Kelvin sign) is a case-variant of "k" and also of "K". So this leads to the revised proposal as follows: PROPOSAL v2 The detailed rules for the effect of the "i" flag are as follows. In these rules, one character C2 is considered to be a *case-variant* of another character C1 if the following XPath expression returns true, when the two characters are considered as strings of length one, and the Unicode codepoint collation is used: fn:lower-case(C1) eq fn:lower-case(C2) or fn:upper-case(C1) eq fn:upper-case(C2) Note that the case-variants of a character under this definition are always single characters. 1. When a normal character (Char) is used as an atom, it represents the set containing that character and all its case-variants. For example, the regular expression "z" will match both "z" and "Z". 2. A character range (charRange) represents the set containing all the characters that it would match in the absence of the "i" flag, together with their case-variants. For example, the regular expression "[A-Z]" will match all the letters A-Z and all the letters a-z. It will also match certain other characters such as x212A (KELVIN SIGN), since fn:lower-case("K") is "k". This rule applies also to a character range used in a character class subtraction (charClassSub): thus [A-Z-[IO]] will match characters such as "A", "B", "a", and "b", but will not match "I", "O", "i", or "o". The rule also applies to a character range used as part of a negative character group: thus [^Q] will match every character except "Q" and "q" (these being the only case-variants of "Q" in Unicode). 3. A back-reference is compared using case-blind comparison: that is, each character must either be the same as the corresponding character of the previously matched string, or must be a case-variant of that character. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the "i" flag is used. 4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}" continues to match upper-case letters only. Michael Kay
Received on Thursday, 15 September 2005 09:00:26 UTC