Re: XML Chunk Equality from Martin Duerst on 2004-11-03 (www-tag@w3.org from November 2004)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 03 Nov 2004 19:26:50 +0900
To: Elliotte Harold <elharo@metalab.unc.edu>
Cc: Chris Lilley <chris@w3.org>, Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Message-Id: <6.0.0.20.2.20041103190800.060a7220@localhost>

At 18:43 04/11/02, Elliotte Harold wrote:
 >
 >Martin Duerst wrote:
 >
 >> I think that first and foremost, the finding should be written in
 >> a way that goes not give the impression that perverse stuff is actually
 >> normal. In my opinion, any discussion about case equivalence outside
 >> US-ASCII would easily give the oppinion that xml:lang values outside
 >> US-ASCII make sense, which we very much agree on that they don't.
 >
 >Perhaps we agree here. Perhaps we don't. It's not quite clear. I think
 >the finding should state that characters are compared by mapping a-z to
 >A-Z and all other characters are compared by code point. But at least to
 >that extent it has to be discussed. We don't need to to explicitly
 >address characters outside the ASCII range. That can just fall under
 >characters that aren't in the range a-z or A-Z. Would that be acceptable
 >to you?

Well, if you say "other characters are compared codepoint-by-codepoint",
that may immediately lead to the following line of thinking:
"well, of course '-' isn't the upper case version of '_', so this
language may have been added to deal with non-ASCII characters".

So there is still a danger that this may suggest to people that
non-ASCII characters are an option.

So I still think that the best thing to do is to just say that values
of xml:lang are case-insensitive as defined by RFC 3066 or its
successor.

 >> If this is really that seriously an issue, I propose that we ask
 >> the XML Core WG to issue an erratum that changes the sentence in
 >> question:
 >>     The values of the attribute are language identifiers as defined
 >>     by [IETF RFC 3066], Tags for the Identification of Languages,
 >>     or its successor; in addition, the empty string MAY be specified.
 >> to something like:
 >>     The values of the attribute MUST be language identifiers as defined
 >>     by [IETF RFC 3066], Tags for the Identification of Languages,
 >>     or its successor; in addition, the empty string MAY be specified.
 >>
 >
 >That would be a substantive change in the spec, not a clarification or
 >an error correction. Such a change requires a new version of XML, not a
 >mere erratum.

We seemed to agree on the fact that 'are' is not very good language.
So we have to figure out what was exactly meant by this, and clarify
it. Based on spec history and other circumstances, it seems clear to
me that something like the proposed text above was actually meant.
That would mean that the change above is a clarification.

If you want to claim that it was the intent of the writers of the XML
spec to explicitly allow non-ASCII language tags, you can do so, but
you would have to bring up some good evidence to convince me and others.

 >> Due to the way 'error' is defined
 >> (see http://www.w3.org/TR/REC-xml/#dt-error), parsers MAY
 >> detect and report this, although probably they won't, because
 >> it's difficult to check and impossible to make the check
 >> future-proof.
 >
 >Errors that are neither well-formedness errors nor validity errors are a
 >real pain and a major source of interoperability problems between
 >different parsers. I certainly don't want to add to this list.

I agree. But please note that your insistence on dealing with what
you yourself call a perverse case doesn't really help us avoid
tightening that language.

Maybe a solution would be to make ASCII-only for xml:lang a
well-formedness constraint, while not requiring that a parser
check the syntax details.

Regards,    Martin.

Received on Thursday, 4 November 2004 00:26:36 UTC