- From: Chris Lilley <chris@w3.org>
- Date: Wed, 27 Oct 2004 01:12:28 +0200
- To: Elliotte Harold <elharo@metalab.unc.edu>
- Cc: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
On Tuesday, October 26, 2004, 11:36:28 PM, Elliotte wrote: EH> Chris Lilley wrote: >> ERH> What's probably intended here is that languages are compared case >> ERH> insensitively within the ASCII range using English case mappings. >> >> No; what is intended here is that *language tags* are compared case >> insensitively. xml:lang="en" and xml:lang="EN" denote the same language. >> Since the intent has clearly been misunderstood, the finding should be >> clarified to say 'language tags are ...' EH> I'm sorry. This is relevant. Okay, let me show why I don't think it is. EH> First of all, language tags should but do EH> not have to be ISO 639 language tags. I agree that they don't have to be 639-1 or 639-2 tags. The first token often is; the second one typically isn't, and any subsequent tokens are unlikely to be. EH> Although some early parsers were EH> confused about this, xml:lang="Français" is well-formed. But not conformant to RFC 3066 or to XML. >> The values of the attribute are language identifiers as defined by >> [IETF RFC 3066], Tags for the Identification of Languages, or its >> successor; in addition, the empty string MAY be specified. http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag From RFC 3066 Language-Tag = Primary-subtag *( "-" Subtag ) Primary-subtag = 1*8ALPHA Subtag = 1*8(ALPHA / DIGIT) We are very securely in ASCII-only territory here. EH> Secondly, even if we stick to ASCII Which the foregoing shows we must, if we conform to the XML specification EH> this is an issue. Consider xml:lang="it". This is the same as EH> xml:lang="IT" when compared in an English locale but not when EH> compared in a Turkish locale. In Java. "it".equalsIgnoreCase("IT") EH> is *false* in Turkey. That is a good point, and something to beware of for implementors. However, it is only a problem if locale settings are used in places where they are unwarranted. (I have seen similar problems where the floating point number 3.5 was converted to the string "3,5" in a French locale, thus causing javascript to misbehave. The problem there is the use of a locale for a conversion that should not, in this case, be locale dependent.) Since XML does not have an SGML declaration (that can be changed by the user) there is a single document character set - UCS. So xml:lang="it" consists of two characters from the UCS, LATIN SMALL LETTER I and LATIN SMALL LETTER T. The Unicode case tables show how to convert these to upper case or to convert LATIN CAPITAL LETTER I to lower case, also how to convert them to title case and how to do case folding. See http://www.unicode.org/charts/case/ -- Chris Lilley mailto:chris@w3.org Chair, W3C SVG Working Group Member, W3C Technical Architecture Group
Received on Tuesday, 26 October 2004 23:12:28 UTC