- From: Chris Lilley <chris@w3.org>
- Date: Wed, 27 Oct 2004 05:30:11 +0200
- To: Elliotte Harold <elharo@metalab.unc.edu>
- Cc: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
On Wednesday, October 27, 2004, 3:24:26 AM, Elliotte wrote: EH> Chris Lilley wrote: >> Which the foregoing shows we must, if we conform to the XML >> specification EH> The XML specification does *not* require that the value of an xml:lang EH> attribute be ASCII. Well-formed XML documents with meaningful infosets EH> can have xml:lang="Français". You will need to demonstrate that, with reference to the productions of XML, before I can accept it. Currently I consider it an erroneous assertion. I understand that you would like XML to be that way, but you need to demonstrate that it is. EH> There is no reason to restrict chunk EH> equality to documents that use RFC 3066 language tags. You deleted my quotation from the XML spec that said exactly that - xml:lang takes an RFC 3066 language tag, or "". I don't find ignoring the quote to be convincing argument. EH> However, even sticking to ASCII we still have a problem because the EH> conversion of i and I is locale dependent in theory and in practice. EH> Other characters are problematic as well. In some locales, mappings are EH> not 1-1. For instance, in French sometimes the lower case form of E is é EH> and sometimes it's e (details vary by country and context). >> That is a good point, and something to beware of for implementors. >> However, it is only a problem if locale settings are used in places >> where they are unwarranted. (I have seen similar problems where the >> floating point number 3.5 was converted to the string "3,5" in a French >> locale, thus causing javascript to misbehave. The problem there is the >> use of a locale for a conversion that should not, in this case, be >> locale dependent.) EH> Case conversion is fundamentally a locale sensitive operation. The EH> question of which characters are uppercase variants of which EH> characters depends on language. For natural language processing, yes, which this is not. EH> For consistent behavior a specification that EH> allows case insensitivity must define the locale according to which case EH> mappings are calculated. Or not introduce the locale into the processing model in the first place, thus giving even better consistency. EH> There is no fundamental rule that says I is the EH> uppercase form of i. I believe I pointed to one, and explained its relevance to the current specific situation. Again, you deleted the reference and didn't discuss it. EH> That is the case mapping used by English, but not EH> by all languages. As currently written, this finding is underspecified. EH> It needs to identify the rules by which different characters are compared. EH> (The difficulty of converting case across multiple locales, languages, EH> and character sets was the primary reason XML was made case sensitive. EH> The earliest draft specs were case insensitive but this proved EH> problematic for localization and internationalization.) That would be before the Unicode case folding tables, then. EH> In this context, I think English rules make sense, These are not 'the English rules'. Unless English somehow acquired Deseret, Greek, Cyrillic and Armenian while I was not looking. These are the Universal Character set rules, which are entirely appropriate for syntactic items like URIs, language tags, and so forth. EH> but the finding needs EH> to say that's what it means; not just assume it. Furthermore it should EH> probably state further that non-ASCII characters are compared code point EH> by code point, and are to that extent case sensitive. Yes, I know there EH> shouldn't be such characters in an xml:lang attribute but the spec EH> doesn't forbid it, so the finding needs to handle it. The spec does appear to forbid it. RFC 3066 forbids it. -- Chris Lilley mailto:chris@w3.org Chair, W3C SVG Working Group Member, W3C Technical Architecture Group
Received on Wednesday, 27 October 2004 03:30:12 UTC