- From: Elliotte Harold <elharo@metalab.unc.edu>
- Date: Wed, 27 Oct 2004 06:40:38 -0400
- To: Chris Lilley <chris@w3.org>
- CC: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Chris Lilley wrote: > EH> Case conversion is fundamentally a locale sensitive operation. The > EH> question of which characters are uppercase variants of which > EH> characters depends on language. > > For natural language processing, yes, which this is not. All case conversion is natural language processing. You can no more convert case without reference to a language than you can spell check. There is no divinely ordained case mapping that applies irrespective of natural language. > Or not introduce the locale into the processing model in the first > place, thus giving even better consistency. Impossible. The question of case mapping is meaningless without reference to specific languages and character repetoires. > EH> There is no fundamental rule that says I is the > EH> uppercase form of i. > > I believe I pointed to one, and explained its relevance to the current > specific situation. Again, you deleted the reference and didn't discuss > it. I'm not sure what you're referring to here. I went back and read your messages and I still see no fundamental rule that says I is the uppercase form of i. > That would be before the Unicode case folding tables, then. The Unicode case folding tables are at least consistent across ASCII. They get trickier in some of the other character blocks. However, the finding does not reference these tables. If that is its intent, then it needs to do so explicitly, and say something like "Languages are compared after case folding according to the Unicode 4.0 case mapping tables". My wording isn't good enough, but you get the idea. > EH> In this context, I think English rules make sense, > > These are not 'the English rules'. Unless English somehow acquired > Deseret, Greek, Cyrillic and Armenian while I was not looking. These are > the Universal Character set rules, which are entirely appropriate for > syntactic items like URIs, language tags, and so forth. > By "English rules" I mean convert ASCII according to English and don't convert anything else. Asking every implementer of chunk equality to carry around the Unicode case mapping tables or some framework like IBM's ICU is a non-starter. It's way too heavyweight for many environments, especially when most non-perverse cases will only use ASCII. Implementers won't follow the spec. They'll design a simple algorithm that works for ASCII and, if they're lucky, doesn't actively corrupt data in other character blocks. Rather than hoping that every XML developer in the world is going to become an expert in internationalization and Unicode arcana overnight, I suggest the finding simply state that ASCII letters are case folded in the normal English way (or you can call it "the way prescribed by the Unicode case mapping tables" if you prefer, even though that's really just English with a more politically acceptable name) before comparison, and all other characters are compared by code point with no case folding. -- Elliotte Rusty Harold elharo@metalab.unc.edu XML in a Nutshell 3rd Edition Just Published! http://www.cafeconleche.org/books/xian3/ http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
Received on Wednesday, 27 October 2004 10:40:41 UTC