Re: XML Chunk Equality from Martin Duerst on 2004-11-02 (www-tag@w3.org from November 2004)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 02 Nov 2004 15:28:19 +0900
To: Chris Lilley <chris@w3.org>, Elliotte Harold <elharo@metalab.unc.edu>
Cc: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Message-Id: <6.0.0.20.2.20041102150925.055d2960@localhost>

At 12:30 04/10/27, Chris Lilley wrote:

 >That would be before the Unicode case folding tables, then.
 >
 >EH> In this context, I think English rules make sense,
 >
 >These are not 'the English rules'. Unless English somehow acquired
 >Deseret, Greek, Cyrillic and Armenian while I was not looking. These are
 >the Universal Character set rules, which are entirely appropriate for
 >syntactic items like URIs, language tags, and so forth.

The Unicode case folding table(s) are appropriate for some
cases, but not for others. In particular, they are mainly
defined for searching, so they may collapse more than necessary.

In the current case, rather than invoking English or Unicode,
I think it's best to say that these tags are case-insensitive
as defined by RFC 3066 or its successor. I have sent a mail
to the authors of
http://www.ietf.org/internet-drafts/draft-phillips-langtags-07.txt
and the relevant mailing list, and copying the people involved
from this thread, so that it can be done in the next draft.

As for RFC 3066, the most reasonable thing to do is to assume that
by context, when it says "case insensitive", it means "case
insensivite as usually used for US-ASCII only" or whatever
exact wording you prefer. There is absolutely no doubt at
all that every participant in the creation and discussion
of RFC 3066 was always assuming this, to the extent that none
of them thought about writing it down.

As for English, English isn't just US-ASCII. If you look at
a good dictionary, you'll see that it occasionally includes
words with diacritics.

Regards,     Martin.

Received on Tuesday, 2 November 2004 06:30:37 UTC