Re: XML Chunk Equality

On Wednesday, October 27, 2004, 3:24:26 AM, Elliotte wrote:

EH> Chris Lilley wrote:


>> Which the foregoing shows we must, if we conform to the XML
>> specification

EH> The XML specification does *not* require that the value of an xml:lang
EH> attribute be ASCII. Well-formed XML documents with meaningful infosets
EH> can have xml:lang="Français".

You will need to demonstrate that, with reference to the productions of
XML, before I can accept it. Currently I consider it an erroneous
assertion. I understand that you would like XML to be that way, but you
need to demonstrate that it is.

EH>  There is no reason to restrict chunk
EH> equality to documents that use RFC 3066 language tags.

You deleted my quotation from the XML spec that said exactly that -
xml:lang takes an RFC 3066 language tag, or "". I don't find ignoring
the quote to be convincing argument.

EH> However, even sticking to ASCII we still have a problem because the 
EH> conversion of i and I is locale dependent in theory and in practice.
EH> Other characters are problematic as well. In some locales, mappings are
EH> not 1-1. For instance, in French sometimes the lower case form of E is é
EH> and sometimes it's e (details vary by country and context).

>> That is a good point, and something to beware of for implementors.
>> However, it is only a problem if locale settings are used in places
>> where they are unwarranted. (I have seen similar problems where the
>> floating point number 3.5 was converted to the string "3,5" in a French
>> locale, thus causing javascript to misbehave. The problem there is the
>> use of a locale for a conversion that should not, in this case, be
>> locale dependent.)

EH> Case conversion is fundamentally a locale sensitive operation. The
EH> question of which characters are uppercase variants of which
EH> characters depends on language.

For natural language processing, yes, which this is not.

EH>  For consistent behavior a specification that
EH> allows case insensitivity must define the locale according to which case
EH> mappings are calculated.

Or not introduce the locale into the processing model in the first
place, thus giving even better consistency.

EH>  There is no fundamental rule that says I is the
EH> uppercase form of i.

I believe I pointed to one, and explained its relevance to the current
specific situation. Again, you deleted the reference and didn't discuss
it.

EH>  That is the case mapping used by English, but not
EH> by all languages. As currently written, this finding is underspecified.
EH> It needs to identify the rules by which different characters are compared.

EH> (The difficulty of converting case across multiple locales, languages,
EH> and character sets was the primary reason XML was made case sensitive.
EH> The earliest draft specs were case insensitive but this proved 
EH> problematic for localization and internationalization.)

That would be before the Unicode case folding tables, then.

EH> In this context, I think English rules make sense,

These are not 'the English rules'. Unless English somehow acquired
Deseret, Greek, Cyrillic and Armenian while I was not looking. These are
the Universal Character set rules, which are entirely appropriate for
syntactic items like URIs, language tags, and so forth.

EH>  but the finding needs
EH> to say that's what it means; not just assume it. Furthermore it should
EH> probably state further that non-ASCII characters are compared code point
EH> by code point, and are to that extent case sensitive. Yes, I know there
EH> shouldn't be such characters in an xml:lang attribute but the spec 
EH> doesn't forbid it, so the finding needs to handle it.

The spec does appear to forbid it. RFC 3066 forbids it.





-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 Member, W3C Technical Architecture Group

Received on Wednesday, 27 October 2004 03:30:12 UTC