- From: <bugzilla@wiggum.w3.org>
- Date: Wed, 31 Aug 2005 21:34:01 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850 ------- Additional Comments From mike@saxonica.com 2005-08-31 21:34 ------- Mary, I'm having trouble understanding exactly what you mean by: Likewise for negative character ranges and so on. That is, you don't mess with the pattern, you check the input string with case folding against the pattern as written. I was originally going to propose a spec which might be what you're suggesting: Under the "i" flag, a string S matches a regex R if there is some case-variant S' of S such that S' matches R in the absence of the "i" flag. A string S' is a case-variant of S if the two strings are the same length and there is a default case mapping between each pair of corresponding characters in the two strings, as defined in section 3.13 of [The Unicode Standard]. This rule seems nice and simple, but it doesn't appear to be the same as Java or Perl, and one must ask whether it is (a) usable, and (b) implementable. It certainly has some surprises, for example "D" matches "[^D]" (because "d" matches "[^D]". I think I will go back to proposing that the tricky cases should be errors. The rule I propose is: when the "i" flag is used, the regex must not include any of the following: * a negative character group * a character class subtraction * a category escape (catEsc, complEsc, or charProp) * any of the multi-character escapes \c, \i, \C, \I * a back-reference If any of these is present when the "i" flag is used, error FORGNNNN is raised. The semantics of the "i" flag is then: A string S matches the regex R under the "i" flag if there exists a string S' that is a case-variant of S such that S' matches R in the absence of the "i" flag; with "case-variant" defined as above. In cases where it is necessary to know which characters matched (for example when $n appears in the replacement string of fn:replace()), the characters that matched are those from the original string S, not from S'. The definition of fn:replace() contains the rule: "If two alternatives within the pattern both match at the same position in the $input, then the match that is chosen is the one matched by the first alternative". I think it would be prudent to relax this rule so that when the "i" flag is used, it is implementation-dependent which match is chosen. That is, if the input string is "a" and the regex is "A|a", it's undefined whether the "A" or the "a" is matched. Michael Kay
Received on Wednesday, 31 August 2005 21:34:06 UTC