- From: <noah_mendelsohn@us.ibm.com>
- Date: Fri, 10 Jun 2005 13:18:24 -0400
- To: "Dare Obasanjo" <dareo@microsoft.com>
- Cc: "Dan Connolly" <connolly@w3.org>, www-tag@w3.org, "Paul Grosso" <pgrosso@arbortext.com>, adamb@google.com
I noticed one other interesting tidbit about the Google sitemaps format. At [1] it says: "Note: Your Sitemap files must use UTF-8 encoding." As we know, the XML Recommendation says that conforming XML processors "All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1". [2] My reading is that restricting the encoding of a particular set of documents cannot strictly violate a constraint on the implementation of processors, but there's a point of view that such restrictions are not in the spirit of the XML Recommendation. That said, I think it may be that the XML Recommendation is the one that's inappropriately prescriptive here. Building processors that accept more than one encoding involves overhead at least in code, and often in efficiency. For example, there are optimizations you can do in handling strings when you know what the encoding will be. Besides, larger code can in and of itself reduce efficiency. I don't think the Google folks are the only ones who've concluded that they prefer UTF-8 only. Exactly for these sorts of reasons, I've never been completely comfortable with the XML Recommendation mixing what I take to be a specification for the legal forms of XML documents with the conformance rules for processors. I'm not proposing to separate those two aspects of the Recommendation at this late date, but I do wonder whether it would be reasonable in future versions of XML to change the rule to something along the lines of "All general purpose XML processors MUST accept UTF-8 and should accept UTF-16." That embodies two changes: by referring to "general purpose" it implies that there may be non-general purpose processors written to accept constrained forms of XML (as there surely are whether the Recommendation admits it or not), and it suggests UTF-8 as the lingua franca. Anyway, I think it's interesting that users of XML are voting with their feet and imposing requirements for UTF-8 only XML. Something to watch. Noah [1] https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapXMLExample [2] http://www.w3.org/TR/REC-xml/#charsets -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 --------------------------------------
Received on Friday, 10 June 2005 17:18:43 UTC