Re: XML Chunk Equality from Elliotte Harold on 2004-10-27 (www-tag@w3.org from October 2004)

From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Wed, 27 Oct 2004 06:40:38 -0400
To: Chris Lilley <chris@w3.org>
CC: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Message-ID: <417F7B26.1050600@metalab.unc.edu>

Chris Lilley wrote:

> EH> Case conversion is fundamentally a locale sensitive operation. The
> EH> question of which characters are uppercase variants of which
> EH> characters depends on language.
> 
> For natural language processing, yes, which this is not.

All case conversion is natural language processing. You can no more 
convert case without reference to a language than you can spell check. 
There is no divinely ordained case mapping that applies irrespective of 
natural language.

> Or not introduce the locale into the processing model in the first
> place, thus giving even better consistency.

Impossible. The question of case mapping is meaningless without 
reference to specific languages and character repetoires.

> EH>  There is no fundamental rule that says I is the
> EH> uppercase form of i.
> 
> I believe I pointed to one, and explained its relevance to the current
> specific situation. Again, you deleted the reference and didn't discuss
> it.

I'm not sure what you're referring to here. I went back and read your 
messages and I still see no fundamental rule that says I is the
uppercase form of i.

> That would be before the Unicode case folding tables, then.

The Unicode case folding tables are at least consistent across ASCII. 
They get trickier in some of the other character blocks. However, the 
finding does not reference these tables. If that is its intent, then it 
needs to do so explicitly, and say something like "Languages are 
compared after case folding according to the Unicode 4.0 case mapping 
tables". My wording isn't good enough, but you get the idea.

> EH> In this context, I think English rules make sense,
> 
> These are not 'the English rules'. Unless English somehow acquired
> Deseret, Greek, Cyrillic and Armenian while I was not looking. These are
> the Universal Character set rules, which are entirely appropriate for
> syntactic items like URIs, language tags, and so forth.
> 

By "English rules" I mean convert ASCII according to English and don't 
convert anything else. Asking every implementer of chunk equality to 
carry around the Unicode case mapping tables or some framework like 
IBM's ICU is a non-starter. It's way too heavyweight for many 
environments, especially when most non-perverse cases will only use 
ASCII. Implementers won't follow the spec. They'll design a simple 
algorithm that works for ASCII and, if they're lucky, doesn't actively 
corrupt data in other character blocks.

Rather than hoping that every XML developer in the world is going to 
become an expert in internationalization and Unicode arcana overnight, I 
suggest the finding simply state that ASCII letters are case folded in 
the normal English way (or you can call it "the way prescribed by the 
Unicode case mapping tables" if you prefer, even though that's really 
just English with a more politically acceptable name) before comparison, 
and all other characters are compared by code point with no case folding.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

Received on Wednesday, 27 October 2004 10:40:41 UTC