Re: XML Chunk Equality from Chris Lilley on 2004-10-26 (www-tag@w3.org from October 2004)

From: Chris Lilley <chris@w3.org>
Date: Wed, 27 Oct 2004 01:12:28 +0200
To: Elliotte Harold <elharo@metalab.unc.edu>
Cc: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Message-ID: <142374996.20041027011228@w3.org>

On Tuesday, October 26, 2004, 11:36:28 PM, Elliotte wrote:

EH> Chris Lilley wrote:

>> ERH> What's probably intended here is that languages are compared case
>> ERH> insensitively within the ASCII range using English case mappings.
>> 
>> No; what is intended here is that *language tags* are compared case
>> insensitively. xml:lang="en" and xml:lang="EN" denote the same language.
>> Since the intent has clearly been misunderstood, the finding should be
>> clarified to say 'language tags are ...'

EH> I'm sorry. This is relevant.

Okay, let me show why I don't think it is.

EH> First of all, language tags should but do
EH> not have to be ISO 639 language tags.

I agree that they don't have to be 639-1 or 639-2 tags. The first token
often is; the second one typically isn't, and any subsequent tokens are
unlikely to be.

EH> Although some early parsers were
EH> confused about this, xml:lang="Français" is well-formed.

But not conformant to RFC 3066 or to XML.

>> The values of the attribute are language identifiers as defined by
>> [IETF RFC 3066], Tags for the Identification of Languages, or its
>> successor; in addition, the empty string MAY be specified.
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag

From RFC 3066

    Language-Tag = Primary-subtag *( "-" Subtag )

    Primary-subtag = 1*8ALPHA

    Subtag = 1*8(ALPHA / DIGIT)

We are very securely in ASCII-only territory here.

EH> Secondly, even if we stick to ASCII

Which the foregoing shows we must, if we conform to the XML
specification

EH> this is an issue. Consider xml:lang="it". This is the same as
EH> xml:lang="IT" when compared in an English locale but not when
EH> compared in a Turkish locale. In Java. "it".equalsIgnoreCase("IT")
EH> is *false* in Turkey.

That is a good point, and something to beware of for implementors.
However, it is only a problem if locale settings are used in places
where they are unwarranted. (I have seen similar problems where the
floating point number 3.5 was converted to the string "3,5" in a French
locale, thus causing javascript to misbehave. The problem there is the
use of a locale for a conversion that should not, in this case, be
locale dependent.)

Since XML does not have an SGML declaration (that can be changed by the
user) there is a single document character set - UCS. So xml:lang="it"
consists of two characters from the UCS, LATIN SMALL LETTER I and LATIN
SMALL LETTER T. The Unicode case tables show how to convert these to
upper case or to convert LATIN CAPITAL LETTER I to lower case, also how
to convert them to title case and how to do case folding.

See http://www.unicode.org/charts/case/

-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 Member, W3C Technical Architecture Group

Received on Tuesday, 26 October 2004 23:12:28 UTC