Re: Internationalized CLASS attributes
On Oct 17, 9:21am, Jonathan Rosenne wrote:
> Bert Bos wrote:
> > However, there is a problem: a conflict between case-insensitivity and
> > allowing non-ASCII characters.
> I don't believe there is added value in case-insensitivity this day and
> age. [...] I suggest that the class names should be defined as case
> But there is another problem with internationalized names: UCS defines a
> non-unique coding. Some composite characters have at least two valid
> representations, the composed character and the base character followed
> by diacritics.
True, and there are also multiple representations because of the
compatibility zone. So for example U+0627 is the arabic letter Aleef,
but so is U+FE8D (Aleef isolate) and U+FE8E (Aleef final), the latter
two being in the compatibility zone which has all (two or four)
Now I concede that the contextual forms are glyph identifiers not
characters and have no real business being in a coded character set
standard in the first place, but there we are.
Another wild gotcha which I discovered while flipping through the
Unicode books: U+101A to U+10C5 is the Georgian archaic uppercase
alphabet, U+10D0 to U+10F0 is the Georgian archaic lowercase alphabet
and the modern Georgian alphabet, which is unicameral (has no case).
The Unicode case table says "Note: the modern Georgian alphabet is
effectively caseless. Georgian SMALL LETTERs should not be upper
cased to CAPITAL LETTERs."
Another relevant quote from the Unicode standard, on the subject of case
"Because there are many more lowercase forms than there are upper, it is
recommended that the lowercase be used for normalisation rather than the
uppercase, such as when strings are case-folded for loose comparison or
Chris Lilley, W3C [ http://www.w3.org/ ]
Graphics and Fonts Guy The World Wide Web Consortium
http://www.w3.org/people/chris/ INRIA, Projet W3C
firstname.lastname@example.org 2004 Rt des Lucioles / BP 93
+33 93 65 79 87 06902 Sophia Antipolis Cedex, France