Re: XML Chunk Equality

Chris Lilley wrote:

> EH> The XML specification does *not* require that the value of an xml:lang
> EH> attribute be ASCII. Well-formed XML documents with meaningful infosets
> EH> can have xml:lang="Français".
> 
> You will need to demonstrate that, with reference to the productions of
> XML, before I can accept it. Currently I consider it an erroneous
> assertion. I understand that you would like XML to be that way, but you
> need to demonstrate that it is.

Easy, it's production 10:


[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
	       |  "'" ([^<&'] | Reference)* "'"


The xml:lang is not treated specially by XML processors (aside from the 
pre-mapping of the xml prefix). It is just like any other attribute, to 
which some processes may choose to assign particular meaning; but to the 
XML parser it's just another attribute.

> EH>  There is no reason to restrict chunk
> EH> equality to documents that use RFC 3066 language tags.
> 
> You deleted my quotation from the XML spec that said exactly that -
> xml:lang takes an RFC 3066 language tag, or "". I don't find ignoring
> the quote to be convincing argument.

OK, I really hoped I wasn't going to have to say this, but I will clime 
down into the much. In Clintonesque fashion, it depends on the meaning 
of the word "are". The relevant quote in the spec is,

"The values of the attribute are language identifiers as defined by 
[IETF RFC 3066], Tags for the Identification of Languages, or its 
successor; in addition, the empty string MAY be specified."

Note what this is not:

1. It is not a BNF production
2. It is a not a well-formedness constraint.
3. It is not a validity constraint.
4. It is not a compatibility constraint
5. It is not any other sort of error.

The XML spec is very careful to explain exactly what is and is not 
required of XML documents. It carefully defines and uses terms like 
MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, 
RECOMMENDED, MAY, and OPTIONAL and indicates that these, "when 
EMPHASIZED, are to be interpreted as described in [IETF RFC 2119]". The 
word "are" is not in this list. There are no grounds in the spec for 
interpreting this as any sort of constraint on the content of legal XML 
documents. It's simply mildly sloppy writing.

Historically, there were BNF productions in the first edition of the XML 
1.0 spec that seemed to suggest that this was a well-formedness issue. 
However, those BNF productions were not actually reachable form any 
other productions so they had no effect. Furthermore, they were 
deliberately and intentionally removed from the 2nd edition of the XML 
1.0 specification to make it really clear that this was not a 
well-formedness issue.

Bottom line: any well-formed attribute value is a legal value for an 
xml:lang attribute. It's not wise to use such a value, but a finding 
such as this must cover all legal XML infosets, not merely the 
non-perverse ones.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

Received on Wednesday, 27 October 2004 09:58:27 UTC