[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From mike@saxonica.com  2005-09-14 22:07 -------
Response to Mary:

I said:

* it's not true that a negative character group is a character class. 

You said:
Uh, yes it is. It do say in XML Schema part 2:
[11]   	charClass	   ::=   	charClassEsc | charClassExpr | WildcardEsc
[12]   	charClassExpr	   ::=   	'[' charGroup ']'
[13]   	charGroup	   ::=   	posCharGroup | negCharGroup | charClassSub
[23]   	charClassEsc	   ::=   	( SingleCharEsc | MultiCharEsc | 
catEsc | complEsc )

I can fill in the posCharGroup and negCharGroup and so on, but I think you
get the idea. Everything is a charClass.

I say: oh no it isn't!

A negative character group is a charGroup, and a charGroup *enclosed in square
brackets* is a charClass. But a negative character group on its own, without the
square brackets, is not a charClass.

As regards \P{Lu}, you can maintain either one of two invariants

(a) \P(Lu) == [^\p{Lu}]

(b) if matches("X", P, "") then matches("x", P, "i") for any regex P

but you can't maintain both.

I think your logic is flawed here:

"If we had written out \p{Lu} as [AB]
that would also have denoted the set {"A","B","a","b"} and the complement
[^AB] would have also denoted the set with lots and lots of characters but not 
"a" or "b".  So again, this is entirely consistent."

You're relying here on [^AB] meaning [^ABab]. But under your proposal that's not
what it means. Under your proposal [^AB] matches every character. [^AB] is a
charClass, therefore rule 2 applies, which says

A character class C denotes a set of strings that contains one
   single-character string "x" for each character x that is either in the class
   or is a case-variant of some character in the class. 

If I'm reading that correctly (perhaps I'm not?) you're saying "a" is in the
class [^AB], therefore "A" is also in the class [^AB].

In my proposal I'm breaking invariant (b): I'm saying that [^AB] is a *smaller*
set of characters under the "i" flag than in the absence of the "i" flag. I
think that's the right thing to do. Having already broken that invariant, I'm
then retaining invariant (a) with my proposed treatment of charClassEsc.

Michael Kay

Received on Wednesday, 14 September 2005 22:07:38 UTC