[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From mike@saxonica.com  2005-09-14 19:12 -------
Use of the word "expand" was perhaps a bit careless. I only used it in examples,
and by saying "A expands to B" I was merely trying to find a shorter way of
saying "A with the i flag set matches the same set of strings as B without the i
flag set". It wasn't intended to describe an algorithm, let alone an
implementation (though I probably had one at the back of my mind).

I appreciate what you're trying to achieve, which I think I can paraphrase as
"if matches(S, P, "") is true, then matches(V(S), P, "i") is true if and only if
V(S) is a case-variant of S." However, I don't think your proposal achieves
this, and in fact I don't think it's a good idea anyway.

I think there are some problems with your proposal. It's not true that a
character range (charRange) is a character class (charClass), and it's not true
that a negative character group is a character class. It is true that "[^Q]" is
a charClass, but if we accept your rule 2, then I think the consequence is that
[^Q] matches every character: in the absence of the "i" flag it matches "q",
therefore in the presence of the "i" flag it also matches "Q". I think the
meaning [^qQ] is more intuitive, and that's why I decided to move the rule down
to the level of a charRange. 

It would be possible to define that a charClassEsc (such as \p{Lu}) matches
case-variants of its "normal" set of strings. The reason I didn't do this was
again to do with complements and subtraction. If you widen \p{Lu} to include
case-variants of its usual characters, do you retain the meaning that \P{Lu} is
the complement of \p{Lu} (in which case it matches a smaller set of characters
than it did before), or do you retain the meaning that it matches all the
characters it would normally match plus their case-variants (a larger set than
before)? I felt it was best to cop out here and say its meaning is unchanged. In
practice, I don't think this is a big problem, because most of the character
blocks already include case-variants of characters, and those that don't, like
Lu and Ll, exclude them very deliberately. 

Michael Kay

Received on Wednesday, 14 September 2005 19:12:15 UTC