[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From mike@saxonica.com  2005-08-31 21:34 -------
Mary, I'm having trouble understanding exactly what you mean by:

Likewise for negative character ranges and so on.

That is, you don't mess with the pattern, you check the input string with case 
folding against the pattern as written.

I was originally going to propose a spec which might be what you're suggesting:
Under the "i" flag, a string S matches a regex R if there is some case-variant
S' of S such that S' matches R in the absence of the "i" flag. A string S' is a
case-variant of S if the two strings are the same length and there is a default
case mapping between each pair of corresponding characters in the two strings,
as defined in section 3.13 of [The Unicode Standard].

This rule seems nice and simple, but it doesn't appear to be the same as Java or
Perl, and one must ask whether it is (a) usable, and (b) implementable. It
certainly has some surprises, for example "D" matches "[^D]" (because "d"
matches "[^D]".

I think I will go back to proposing that the tricky cases should be errors. The
rule I propose is: when the "i" flag is used, the regex must not include any of
the following:

* a negative character group
* a character class subtraction
* a category escape (catEsc, complEsc, or charProp)
* any of the multi-character escapes \c, \i, \C, \I
* a back-reference

If any of these is present when the "i" flag is used, error FORGNNNN is raised.

The semantics of the "i" flag is then: A string S matches the regex R under the
"i" flag if there exists a string S' that is a case-variant of S such that S'
matches R in the absence of the "i" flag; with "case-variant" defined as above.

In cases where it is necessary to know which characters matched (for example
when $n appears in the replacement string of fn:replace()), the characters that
matched are those from the original string S, not from S'.

The definition of fn:replace() contains the rule: "If two alternatives within
the pattern both match at the same position in the $input, then the match that
is chosen is the one matched by the first alternative". I think it would be
prudent to relax this rule so that when the "i" flag is used, it is
implementation-dependent which match is chosen. That is, if the input string is
"a" and the regex is "A|a", it's undefined whether the "A" or the "a" is matched.

Michael Kay

Received on Wednesday, 31 August 2005 21:34:06 UTC