- From: <noah_mendelsohn@us.ibm.com>
- Date: Fri, 10 Jun 2005 13:18:24 -0400
- To: "Dare Obasanjo" <dareo@microsoft.com>
- Cc: "Dan Connolly" <connolly@w3.org>, www-tag@w3.org, "Paul Grosso" <pgrosso@arbortext.com>, adamb@google.com
I noticed one other interesting tidbit about the Google sitemaps format.
At [1] it says:
"Note: Your Sitemap files must use UTF-8 encoding."
As we know, the XML Recommendation says that conforming XML processors "All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode
3.1". [2] My reading is that restricting the encoding of a particular set of
documents cannot strictly violate a constraint on the implementation of
processors, but there's a point of view that such restrictions are not in
the spirit of the XML Recommendation.
That said, I think it may be that the XML Recommendation is the one that's
inappropriately prescriptive here. Building processors that accept more
than one encoding involves overhead at least in code, and often in
efficiency. For example, there are optimizations you can do in handling
strings when you know what the encoding will be. Besides, larger code can
in and of itself reduce efficiency. I don't think the Google folks are
the only ones who've concluded that they prefer UTF-8 only.
Exactly for these sorts of reasons, I've never been completely comfortable
with the XML Recommendation mixing what I take to be a specification for
the legal forms of XML documents with the conformance rules for
processors. I'm not proposing to separate those two aspects of the
Recommendation at this late date, but I do wonder whether it would be
reasonable in future versions of XML to change the rule to something along
the lines of "All general purpose XML processors MUST accept UTF-8 and
should accept UTF-16." That embodies two changes: by referring to
"general purpose" it implies that there may be non-general purpose
processors written to accept constrained forms of XML (as there surely are
whether the Recommendation admits it or not), and it suggests UTF-8 as the
lingua franca.
Anyway, I think it's interesting that users of XML are voting with their
feet and imposing requirements for UTF-8 only XML. Something to watch.
Noah
[1]
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapXMLExample
[2] http://www.w3.org/TR/REC-xml/#charsets
--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Friday, 10 June 2005 17:18:43 UTC