Requiring UTF-8 for XML (was: RE: google sitemaps and some history of sitemaps [siteData-36])

I noticed one other interesting tidbit about the Google sitemaps format. 
At [1] it says:

        "Note: Your Sitemap files must use UTF-8 encoding."

As we know, the XML Recommendation says that conforming XML processors "All XML processors MUST accept the UTF-8 and UTF-16 encodings of  Unicode 
3.1". [2]  My reading is that restricting the encoding of a particular set of 
documents cannot strictly violate a constraint on the implementation of 
processors, but there's a point of view that such restrictions are not in 
the spirit of the XML Recommendation.

That said, I think it may be that the XML Recommendation is the one that's 
inappropriately prescriptive here.   Building processors that accept more 
than one encoding involves overhead at least in code, and often in 
efficiency.  For example, there are optimizations you can do in handling 
strings when you know what the encoding will be.  Besides, larger code can 
in and of itself reduce efficiency.  I don't think the Google folks are 
the only ones who've concluded that they prefer UTF-8 only.

Exactly for these sorts of reasons, I've never been completely comfortable 
with the XML Recommendation mixing what I take to be a specification for 
the legal forms of XML documents with the conformance rules for 
processors.  I'm not proposing to separate those two aspects of the 
Recommendation at this late date, but I do wonder whether it would be 
reasonable in future versions of XML to change the rule to something along 
the lines of "All general purpose XML processors MUST accept UTF-8 and 
should accept UTF-16."  That embodies two changes:  by referring to 
"general purpose" it implies that there may be non-general purpose 
processors written to accept constrained forms of XML (as there surely are 
whether the Recommendation admits it or not), and it suggests UTF-8 as the 
lingua franca.

Anyway, I think it's interesting that users of XML are voting with their 
feet and imposing requirements for UTF-8 only XML.  Something to watch.

Noah

[1] 
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapXMLExample
[2] http://www.w3.org/TR/REC-xml/#charsets

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Received on Friday, 10 June 2005 17:18:43 UTC