Re: XML Chunk Equality from Martin Duerst on 2004-11-02 (www-tag@w3.org from November 2004)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 02 Nov 2004 15:07:19 +0900
To: Elliotte Harold <elharo@metalab.unc.edu>, Chris Lilley <chris@w3.org>
Cc: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
Message-Id: <6.0.0.20.2.20041102140759.06519658@localhost>
At 18:58 04/10/27, Elliotte Harold wrote:

 >OK, I really hoped I wasn't going to have to say this, but I will clime 
down into the much. In Clintonesque fashion, it depends on the meaning of 
the word "are". The relevant quote in the spec is,
 >
 >"The values of the attribute are language identifiers as defined by [IETF 
RFC 3066], Tags for the Identification of Languages, or its successor; in 
addition, the empty string MAY be specified."
 >
 >Note what this is not:
 >
 >1. It is not a BNF production
 >2. It is a not a well-formedness constraint.
 >3. It is not a validity constraint.
 >4. It is not a compatibility constraint
 >5. It is not any other sort of error.

Just a bit of history here. The first edition of XML, at
http://www.w3.org/TR/1998/REC-xml-19980210.html#sec-lang-tag,
had explicit productions for xml:lang, based on RFC 1766.

This was found to be overly restrictive, because it only allowed
two-letter language codes, limiting the use of xml:lang (if not
XML as such) to a small subset of existing languages. Therefore,
in preparation for the update to RFC 3066 (and potentially later),
the productions [33]-[38] were removed. The second edition of
XML, at http://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag,
still referred to RFC 1766, but also to a potential successor from
the IETF.

All this was done in careful collaboration between the  I18N WG
and the XML (whatever at that time) WG.

 >The XML spec is very careful to explain exactly what is and is not 
required of XML documents. It carefully defines and uses terms like MUST, 
MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, 
and OPTIONAL and indicates that these, "when EMPHASIZED, are to be 
interpreted as described in [IETF RFC 2119]". The word "are" is not in this 
list. There are no grounds in the spec for interpreting this as any sort of 
constraint on the content of legal XML documents. It's simply mildly sloppy 
writing.

There are indeed no grounds for interpreting the 'are' as a constraint
on parsers, because there is no such constraint. The reason for this is
not that it wouldn't be desirable to have such a constraint, but that
it turned out that languages are living things, and language identifiers
were and are still not understood and developped fully enough to be able
to nail things down on the level of an XML parser.

On the other hand, the word 'are' is a constraint on the content of XML
documents. Whether you want to call that a 'legal' constraint (the XML
REC is not a law, so this looks a bit out of place to me) or what, it
is very clear by language, history, and usage, that this is intended
as a constraint. The fact that it is not enforced by the parser doesn't
change the fact that it is a constraint.

If you use something like xml:lang="Fran軋is" [sorry, that won't render
correctly on your side, due to some limitations in my email software],
it's crap, and nothing but crap, whether or not the XML parser tells
you so or not. The 'are' may be sloppy language, but it doesn't change
the fact that this example is crap.


 >Historically, there were BNF productions in the first edition of the XML 
1.0 spec that seemed to suggest that this was a well-formedness issue. 
However, those BNF productions were not actually reachable form any other 
productions so they had no effect.

This is just a detail, but the fact that they were not reached is not
necessarily relevant. I agree that the spec could have been clearer
on indicating that this was intended to be a well-formedness issue.
It could just have said something like "in order for the value of
xml:lang to be well-formed, it must conform to the following grammar:".
Specs using XML make restrictions similar to this all the time.
That xml:lang is defined in the XML spec doesn't change the fact
that it's very similar to any other attribute.


 >Furthermore, they were deliberately and intentionally removed from the 
2nd edition of the XML 1.0 specification to make it really clear that this 
was not a well-formedness issue.

Yes, but definitely not because anybody wanted to get any crap like
xml:lang="Fran軋is" in there. See my explanations above.


 >Bottom line: any well-formed attribute value is a legal value for an 
xml:lang attribute. It's not wise to use such a value, but a finding such 
as this must cover all legal XML infosets, not merely the non-perverse ones.

Glad to see that you are calling xml:lang="Fran軋is" 'perverse', this
is probably even a bit thougher than the word I used, 'crap'.

I think that first and foremost, the finding should be written in
a way that goes not give the impression that perverse stuff is actually
normal. In my opinion, any discussion about case equivalence outside
US-ASCII would easily give the oppinion that xml:lang values outside
US-ASCII make sense, which we very much agree on that they don't.

If this is really that seriously an issue, I propose that we ask
the XML Core WG to issue an erratum that changes the sentence in
question:
     The values of the attribute are language identifiers as defined
     by [IETF RFC 3066], Tags for the Identification of Languages,
     or its successor; in addition, the empty string MAY be specified.
to something like:
     The values of the attribute MUST be language identifiers as defined
     by [IETF RFC 3066], Tags for the Identification of Languages,
     or its successor; in addition, the empty string MAY be specified.

Due to the way 'error' is defined
(see http://www.w3.org/TR/REC-xml/#dt-error), parsers MAY
detect and report this, although probably they won't, because
it's difficult to check and impossible to make the check
future-proof.

Regards,     Martin.
Received on Tuesday, 2 November 2004 06:30:34 UTC