- From: Chris Lilley <chris@w3.org>
- Date: Wed, 27 Oct 2004 01:12:28 +0200
- To: Elliotte Harold <elharo@metalab.unc.edu>
- Cc: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
On Tuesday, October 26, 2004, 11:36:28 PM, Elliotte wrote:
EH> Chris Lilley wrote:
>> ERH> What's probably intended here is that languages are compared case
>> ERH> insensitively within the ASCII range using English case mappings.
>>
>> No; what is intended here is that *language tags* are compared case
>> insensitively. xml:lang="en" and xml:lang="EN" denote the same language.
>> Since the intent has clearly been misunderstood, the finding should be
>> clarified to say 'language tags are ...'
EH> I'm sorry. This is relevant.
Okay, let me show why I don't think it is.
EH> First of all, language tags should but do
EH> not have to be ISO 639 language tags.
I agree that they don't have to be 639-1 or 639-2 tags. The first token
often is; the second one typically isn't, and any subsequent tokens are
unlikely to be.
EH> Although some early parsers were
EH> confused about this, xml:lang="Français" is well-formed.
But not conformant to RFC 3066 or to XML.
>> The values of the attribute are language identifiers as defined by
>> [IETF RFC 3066], Tags for the Identification of Languages, or its
>> successor; in addition, the empty string MAY be specified.
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag
From RFC 3066
Language-Tag = Primary-subtag *( "-" Subtag )
Primary-subtag = 1*8ALPHA
Subtag = 1*8(ALPHA / DIGIT)
We are very securely in ASCII-only territory here.
EH> Secondly, even if we stick to ASCII
Which the foregoing shows we must, if we conform to the XML
specification
EH> this is an issue. Consider xml:lang="it". This is the same as
EH> xml:lang="IT" when compared in an English locale but not when
EH> compared in a Turkish locale. In Java. "it".equalsIgnoreCase("IT")
EH> is *false* in Turkey.
That is a good point, and something to beware of for implementors.
However, it is only a problem if locale settings are used in places
where they are unwarranted. (I have seen similar problems where the
floating point number 3.5 was converted to the string "3,5" in a French
locale, thus causing javascript to misbehave. The problem there is the
use of a locale for a conversion that should not, in this case, be
locale dependent.)
Since XML does not have an SGML declaration (that can be changed by the
user) there is a single document character set - UCS. So xml:lang="it"
consists of two characters from the UCS, LATIN SMALL LETTER I and LATIN
SMALL LETTER T. The Unicode case tables show how to convert these to
upper case or to convert LATIN CAPITAL LETTER I to lower case, also how
to convert them to title case and how to do case folding.
See http://www.unicode.org/charts/case/
--
Chris Lilley mailto:chris@w3.org
Chair, W3C SVG Working Group
Member, W3C Technical Architecture Group
Received on Tuesday, 26 October 2004 23:12:28 UTC