Re: XML Chunk Equality from Elliotte Harold on 2004-10-27 (www-tag@w3.org from October 2004)

From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Tue, 26 Oct 2004 21:24:26 -0400
To: Chris Lilley <chris@w3.org>
CC: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Message-ID: <417EF8CA.5020709@metalab.unc.edu>

Chris Lilley wrote:

> Which the foregoing shows we must, if we conform to the XML
> specification

The XML specification does *not* require that the value of an xml:lang 
attribute be ASCII. Well-formed XML documents with meaningful infosets 
can have xml:lang="Français". There is no reason to restrict chunk 
equality to documents that use RFC 3066 language tags.

However, even sticking to ASCII we still have a problem because the 
conversion of i and I is locale dependent in theory and in practice. 
Other characters are problematic as well. In some locales, mappings are 
not 1-1. For instance, in French sometimes the lower case form of E is é 
and sometimes it's e (details vary by country and context).

> That is a good point, and something to beware of for implementors.
> However, it is only a problem if locale settings are used in places
> where they are unwarranted. (I have seen similar problems where the
> floating point number 3.5 was converted to the string "3,5" in a French
> locale, thus causing javascript to misbehave. The problem there is the
> use of a locale for a conversion that should not, in this case, be
> locale dependent.)

Case conversion is fundamentally a locale sensitive operation. The 
question of which characters are uppercase variants of which characters 
depends on language.  For consistent behavior a specification that 
allows case insensitivity must define the locale according to which case 
mappings are calculated. There is no fundamental rule that says I is the 
uppercase form of i. That is the case mapping used by English, but not 
by all languages. As currently written, this finding is underspecified. 
It needs to identify the rules by which different characters are compared.

(The difficulty of converting case across multiple locales, languages, 
and character sets was the primary reason XML was made case sensitive. 
The earliest draft specs were case insensitive but this proved 
problematic for localization and internationalization.)

In this context, I think English rules make sense, but the finding needs 
to say that's what it means; not just assume it. Furthermore it should 
probably state further that non-ASCII characters are compared code point 
by code point, and are to that extent case sensitive. Yes, I know there 
shouldn't be such characters in an xml:lang attribute but the spec 
doesn't forbid it, so the finding needs to handle it.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

Received on Wednesday, 27 October 2004 01:24:29 UTC