[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From holstege@mathling.com  2005-09-14 19:41 -------
If we rephrase "expands" I'm happier with your proposal, even if we touch 
nothing else, although I'd still prefer to state some general rule rather than
take it by cases, but I could live without doing so.
 
> I think there are some problems with your proposal. It's not true that a
> character range (charRange) is a character class (charClass), and it's not 
true
> that a negative character group is a character class. 

Uh, yes it is. It do say in XML Schema part 2:
[11]   	charClass	   ::=   	charClassEsc | charClassExpr | WildcardEsc
[12]   	charClassExpr	   ::=   	'[' charGroup ']'
[13]   	charGroup	   ::=   	posCharGroup | negCharGroup | charClassSub
[23]   	charClassEsc	   ::=   	( SingleCharEsc | MultiCharEsc | 
catEsc | complEsc )

I can fill in the posCharGroup and negCharGroup and so on, but I think you
get the idea. Everything is a charClass.

I see your point with \p{Lu} and \P{Lu}; let's think about that a bit out loud
to see where we get:

Let just say for abbreviation's sake that normally \p{Lu} denotes the set 
{"A","B"}.  \P{Lu} = [^\p{Lu}] so sayeth Datatypes, so this includes a set
of lots and lots of single-character strings, including "a" and "b".
If instead of using the handy abbreviation \p{Lu} we had spelled it out:
[AB], denoting the set {"A","B"} and the complement would be [^AB], denoting a
set containing lots and lots of single-character strings, including "a" and "b", 
so this is all consistent.

Under the rules of the "i" flag, if we say \p{Lu} means what it means with
other character classes, it denotes the set {"A", "B", "a", "b"}. Following
the equation from Datatypes we get that \P{Lu} denotes a set with lots and
lots of characters but not "a" or "b".  If we had written out \p{Lu} as [AB]
that would also have denoted the set {"A","B","a","b"} and the complement
[^AB] would have also denoted the set with lots and lots of characters but not 
"a" or "b".  So again, this is entirely consistent.

Suppose, however, that under the rules of the "i" flag, we leave \p{Lu} and 
\P{Lu} alone. The \p{Lu} denotes the set {"A","B"}, and \P{Lu} denotes the
set with lots and lots of single character strings including "a" and "b".
If, not knowing this handy abbreviation, I had written out \p{Lu} as [AB], 
I will denote a different set under the "i" flag: {"A","B","a","b"}. Likewise
[^AB] will denote a set that does not include "a" and "b".  

I find this inconsistency pretty baffling to explain, and having to special
case here makes implementation harder.  So I think we should apply the rule
consistently across all character classes.

Received on Wednesday, 14 September 2005 19:41:29 UTC