- From: Elliotte Harold <elharo@metalab.unc.edu>
- Date: Tue, 26 Oct 2004 21:24:26 -0400
- To: Chris Lilley <chris@w3.org>
- CC: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Chris Lilley wrote: > Which the foregoing shows we must, if we conform to the XML > specification The XML specification does *not* require that the value of an xml:lang attribute be ASCII. Well-formed XML documents with meaningful infosets can have xml:lang="Français". There is no reason to restrict chunk equality to documents that use RFC 3066 language tags. However, even sticking to ASCII we still have a problem because the conversion of i and I is locale dependent in theory and in practice. Other characters are problematic as well. In some locales, mappings are not 1-1. For instance, in French sometimes the lower case form of E is é and sometimes it's e (details vary by country and context). > That is a good point, and something to beware of for implementors. > However, it is only a problem if locale settings are used in places > where they are unwarranted. (I have seen similar problems where the > floating point number 3.5 was converted to the string "3,5" in a French > locale, thus causing javascript to misbehave. The problem there is the > use of a locale for a conversion that should not, in this case, be > locale dependent.) Case conversion is fundamentally a locale sensitive operation. The question of which characters are uppercase variants of which characters depends on language. For consistent behavior a specification that allows case insensitivity must define the locale according to which case mappings are calculated. There is no fundamental rule that says I is the uppercase form of i. That is the case mapping used by English, but not by all languages. As currently written, this finding is underspecified. It needs to identify the rules by which different characters are compared. (The difficulty of converting case across multiple locales, languages, and character sets was the primary reason XML was made case sensitive. The earliest draft specs were case insensitive but this proved problematic for localization and internationalization.) In this context, I think English rules make sense, but the finding needs to say that's what it means; not just assume it. Furthermore it should probably state further that non-ASCII characters are compared code point by code point, and are to that extent case sensitive. Yes, I know there shouldn't be such characters in an xml:lang attribute but the spec doesn't forbid it, so the finding needs to handle it. -- Elliotte Rusty Harold elharo@metalab.unc.edu XML in a Nutshell 3rd Edition Just Published! http://www.cafeconleche.org/books/xian3/ http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
Received on Wednesday, 27 October 2004 01:24:29 UTC