- From: Ivan Herman <ivan@w3.org>
- Date: Sun, 12 Jun 2005 09:24:47 +0200
- To: noah_mendelsohn@us.ibm.com
- Cc: Dare Obasanjo <dareo@microsoft.com>, Dan Connolly <connolly@w3.org>, www-tag@w3.org, Paul Grosso <pgrosso@arbortext.com>, adamb@google.com
- Message-ID: <42ABE33F.40705@w3.org>
As far as I know (though not being a Unicode expert), while UTF-8 is probably the right encoding (in terms of size, processing, etc) for documents that use predominantly Latin characters with occasional, say, Chinese characters, it is definitely not true for documents predominantly Chinese or Korean, for example. Those would prefer a UTF-16 encoding, because it is much more efficient. In other words, choosing UTF-8 over UTF-16 in this respect might clearly reflects a linguistic bias... which is against the spirit of W3C recommendations. Ivan noah_mendelsohn@us.ibm.com wrote: > I noticed one other interesting tidbit about the Google sitemaps format. > At [1] it says: > > "Note: Your Sitemap files must use UTF-8 encoding." > > As we know, the XML Recommendation says that conforming XML processors "All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode > 3.1". [2] My reading is that restricting the encoding of a particular set of > documents cannot strictly violate a constraint on the implementation of > processors, but there's a point of view that such restrictions are not in > the spirit of the XML Recommendation. > > That said, I think it may be that the XML Recommendation is the one that's > inappropriately prescriptive here. Building processors that accept more > than one encoding involves overhead at least in code, and often in > efficiency. For example, there are optimizations you can do in handling > strings when you know what the encoding will be. Besides, larger code can > in and of itself reduce efficiency. I don't think the Google folks are > the only ones who've concluded that they prefer UTF-8 only. > > Exactly for these sorts of reasons, I've never been completely comfortable > with the XML Recommendation mixing what I take to be a specification for > the legal forms of XML documents with the conformance rules for > processors. I'm not proposing to separate those two aspects of the > Recommendation at this late date, but I do wonder whether it would be > reasonable in future versions of XML to change the rule to something along > the lines of "All general purpose XML processors MUST accept UTF-8 and > should accept UTF-16." That embodies two changes: by referring to > "general purpose" it implies that there may be non-general purpose > processors written to accept constrained forms of XML (as there surely are > whether the Recommendation admits it or not), and it suggests UTF-8 as the > lingua franca. > > Anyway, I think it's interesting that users of XML are voting with their > feet and imposing requirements for UTF-8 only XML. Something to watch. > > Noah > > [1] > https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapXMLExample > [2] http://www.w3.org/TR/REC-xml/#charsets > > -------------------------------------- > Noah Mendelsohn > IBM Corporation > One Rogers Street > Cambridge, MA 02142 > 1-617-693-4036 > -------------------------------------- > > > > > > -- Ivan Herman W3C Communications Team, Head of Offices C/o W3C Benelux Office at CWI, Kruislaan 413 1098SJ Amsterdam, The Netherlands tel: +31-20-5924163; mobile: +31-641044153; URL: http://www.w3.org/People/Ivan/
Received on Sunday, 12 June 2005 07:25:05 UTC