- From: <bugzilla@wiggum.w3.org>
- Date: Wed, 14 Sep 2005 16:25:49 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850 ------- Additional Comments From holstege@mathling.com 2005-09-14 16:25 ------- First, I'd like to thank Michael for this proposal. It is certainly clear, and while there are behaviours that are perhaps unexpected, I think that is inevitable in this area. Acknowledging Michael's comments about the overflowing trashbin (and contributing a few crumpled sheets there myself), I nevertheless find myself unhappy with talking about "expanding" the regular expression and would prefer to shift to speaking about case-folding as applying to how the input string is matched. >From an implementation point of view, expanding regular expressions has to be done on a case-by-case basis (no pun intended!). While it doesn't make it impossible to cache regular expressions (i.e. pre-analyze and parse them), it does make it trickier and less useful to do so, as the regular expression itself is no longer a sufficient key to what the analyzed regular expression is. A consequence of this shift would be that case-folding would apply uniformly, so that, for example: fn:matches( "d", "\p{Lu}", "i" ) = fn:matches( "d", "[A-Z]", "i" ) which is not the case under Michael's proposal. I would go on to argue that it would be good if both of these were true. One reason for making this so is that Datatypes says that "\P{Lu}" == [^\p{Lu}] and therefore you get some odd inconsistencies if you don't apply the case-folding to the category escapes as well. All of which sums up to putting an obligation on my to come up with a counter-proposal. My general tack on this is to tweak two statements in XML Schema Datatypes that define what set of strings a character denotes and what set of strings a character class denotes. But I think Michael's case by case exposition is most excellent and clear, and so I continue with that, tweaking the verbiage to avoid the "expands" phrasing, treating it as clarification because those two rules are sufficient, and adding the additional cases that Michael's proposal doesn't touch. COUNTER-PROPOSAL: The detailed rules for the effect of the "i" flag are as follows. In these rules, one character is considered to be a *case-variant* of another character if there is a default case mapping between the two characters as defined in section 3.13 of [The Unicode Standard]. Note that the case-variants of a character under this definition are always single characters. The rules for regular expressions in [XML Schema Part 2: Datatypes Second Edition] are modified under the influence of the "i" flag in the following way: 1. A normal character c denotes a set of strings that contains one single-character string "x" for each character x that is either c or a case-variant of c. 2. A character class C denotes a set of strings that contains one single-character string "x" for each character x that is either in the class or is a case-variant of some character in the class. Specifically, the application of these rules means: * When a normal character (Char) is used as an atom, it represents the set containing that character and all its case-variants. For example, the regular expression "z" matches the same set of characters as "[zZ]". * A character range (charRange) is a character class, and therefore represents the set containing all the characters that it would match in the absence of the "i" flag, together with their case-variants. For example, "[A-Z]" matches the same set of characters as "[A-Za-z]". * A character range used in character class subtraction (charClassSub) also represents the set containing all the characters that it would match in the absense of the "i" flag, together with their case-variants. For example, "[A-Z-[IO]]" matches the same set of characters as "[A-Za-z-[IOio]]". * A negative character group (negCharGroup) is also a character class and the same rule applies. For example, "[^Q]" matches the same set of characters as "[^Qq]". * A category escape (catEsc) is also a character class and the same rule applies. For example, "\p{Lu}" matches all the upper case letters and their case-variants, and thus the string "d" would match "\p{Lu}". * A complement category escape (complEsc) is also a character class and the same rule applies. For example, "\P{Lu}" matches all letters that are neither upper case nor one of those character's case variants. Therefore "d" would not match "\P{Lu}". * The same rule applies to single-character (SingleCharEsc) and multi-character (MultiCharEsc) escapes, although in practice this will have no effect. * A back-reference is compared using case-blind comparison: that is, each character must either be the same as the corresponding character of the previously matched string, or must be a case-variant of that character. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the "i" flag is used.
Received on Wednesday, 14 September 2005 16:26:02 UTC