- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 02 Nov 2004 15:07:19 +0900
- To: Elliotte Harold <elharo@metalab.unc.edu>, Chris Lilley <chris@w3.org>
- Cc: Norman Walsh <Norman.Walsh@Sun.COM>, www-tag@w3.org
At 18:58 04/10/27, Elliotte Harold wrote: >OK, I really hoped I wasn't going to have to say this, but I will clime down into the much. In Clintonesque fashion, it depends on the meaning of the word "are". The relevant quote in the spec is, > >"The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string MAY be specified." > >Note what this is not: > >1. It is not a BNF production >2. It is a not a well-formedness constraint. >3. It is not a validity constraint. >4. It is not a compatibility constraint >5. It is not any other sort of error. Just a bit of history here. The first edition of XML, at http://www.w3.org/TR/1998/REC-xml-19980210.html#sec-lang-tag, had explicit productions for xml:lang, based on RFC 1766. This was found to be overly restrictive, because it only allowed two-letter language codes, limiting the use of xml:lang (if not XML as such) to a small subset of existing languages. Therefore, in preparation for the update to RFC 3066 (and potentially later), the productions [33]-[38] were removed. The second edition of XML, at http://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag, still referred to RFC 1766, but also to a potential successor from the IETF. All this was done in careful collaboration between the I18N WG and the XML (whatever at that time) WG. >The XML spec is very careful to explain exactly what is and is not required of XML documents. It carefully defines and uses terms like MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL and indicates that these, "when EMPHASIZED, are to be interpreted as described in [IETF RFC 2119]". The word "are" is not in this list. There are no grounds in the spec for interpreting this as any sort of constraint on the content of legal XML documents. It's simply mildly sloppy writing. There are indeed no grounds for interpreting the 'are' as a constraint on parsers, because there is no such constraint. The reason for this is not that it wouldn't be desirable to have such a constraint, but that it turned out that languages are living things, and language identifiers were and are still not understood and developped fully enough to be able to nail things down on the level of an XML parser. On the other hand, the word 'are' is a constraint on the content of XML documents. Whether you want to call that a 'legal' constraint (the XML REC is not a law, so this looks a bit out of place to me) or what, it is very clear by language, history, and usage, that this is intended as a constraint. The fact that it is not enforced by the parser doesn't change the fact that it is a constraint. If you use something like xml:lang="Fran軋is" [sorry, that won't render correctly on your side, due to some limitations in my email software], it's crap, and nothing but crap, whether or not the XML parser tells you so or not. The 'are' may be sloppy language, but it doesn't change the fact that this example is crap. >Historically, there were BNF productions in the first edition of the XML 1.0 spec that seemed to suggest that this was a well-formedness issue. However, those BNF productions were not actually reachable form any other productions so they had no effect. This is just a detail, but the fact that they were not reached is not necessarily relevant. I agree that the spec could have been clearer on indicating that this was intended to be a well-formedness issue. It could just have said something like "in order for the value of xml:lang to be well-formed, it must conform to the following grammar:". Specs using XML make restrictions similar to this all the time. That xml:lang is defined in the XML spec doesn't change the fact that it's very similar to any other attribute. >Furthermore, they were deliberately and intentionally removed from the 2nd edition of the XML 1.0 specification to make it really clear that this was not a well-formedness issue. Yes, but definitely not because anybody wanted to get any crap like xml:lang="Fran軋is" in there. See my explanations above. >Bottom line: any well-formed attribute value is a legal value for an xml:lang attribute. It's not wise to use such a value, but a finding such as this must cover all legal XML infosets, not merely the non-perverse ones. Glad to see that you are calling xml:lang="Fran軋is" 'perverse', this is probably even a bit thougher than the word I used, 'crap'. I think that first and foremost, the finding should be written in a way that goes not give the impression that perverse stuff is actually normal. In my opinion, any discussion about case equivalence outside US-ASCII would easily give the oppinion that xml:lang values outside US-ASCII make sense, which we very much agree on that they don't. If this is really that seriously an issue, I propose that we ask the XML Core WG to issue an erratum that changes the sentence in question: The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string MAY be specified. to something like: The values of the attribute MUST be language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string MAY be specified. Due to the way 'error' is defined (see http://www.w3.org/TR/REC-xml/#dt-error), parsers MAY detect and report this, although probably they won't, because it's difficult to check and impossible to make the check future-proof. Regards, Martin.
Received on Tuesday, 2 November 2004 06:30:34 UTC