Re: Requiring UTF-8 for XML (was: RE: google sitemaps and some history of sitemaps [siteData-36]) from Ivan Herman on 2005-06-12 (www-tag@w3.org from June 2005)

From: Ivan Herman <ivan@w3.org>
Date: Sun, 12 Jun 2005 09:24:47 +0200
To: noah_mendelsohn@us.ibm.com
Cc: Dare Obasanjo <dareo@microsoft.com>, Dan Connolly <connolly@w3.org>, www-tag@w3.org, Paul Grosso <pgrosso@arbortext.com>, adamb@google.com
Message-ID: <42ABE33F.40705@w3.org>

As far as I know (though not being a Unicode expert), while UTF-8 is probably the right 
encoding (in terms of size, processing, etc) for documents that use predominantly Latin 
characters with occasional, say, Chinese characters, it is definitely not true for 
documents predominantly Chinese or Korean, for example. Those would prefer a UTF-16 
encoding, because it is much more efficient. In other words, choosing UTF-8 over UTF-16 in 
this respect might clearly reflects a linguistic bias... which is against the spirit of 
W3C recommendations.

Ivan

noah_mendelsohn@us.ibm.com wrote:
> I noticed one other interesting tidbit about the Google sitemaps format. 
> At [1] it says:
> 
>         "Note: Your Sitemap files must use UTF-8 encoding."
> 
> As we know, the XML Recommendation says that conforming XML processors "All XML processors MUST accept the UTF-8 and UTF-16 encodings of  Unicode 
> 3.1". [2]  My reading is that restricting the encoding of a particular set of 
> documents cannot strictly violate a constraint on the implementation of 
> processors, but there's a point of view that such restrictions are not in 
> the spirit of the XML Recommendation.
> 
> That said, I think it may be that the XML Recommendation is the one that's 
> inappropriately prescriptive here.   Building processors that accept more 
> than one encoding involves overhead at least in code, and often in 
> efficiency.  For example, there are optimizations you can do in handling 
> strings when you know what the encoding will be.  Besides, larger code can 
> in and of itself reduce efficiency.  I don't think the Google folks are 
> the only ones who've concluded that they prefer UTF-8 only.
> 
> Exactly for these sorts of reasons, I've never been completely comfortable 
> with the XML Recommendation mixing what I take to be a specification for 
> the legal forms of XML documents with the conformance rules for 
> processors.  I'm not proposing to separate those two aspects of the 
> Recommendation at this late date, but I do wonder whether it would be 
> reasonable in future versions of XML to change the rule to something along 
> the lines of "All general purpose XML processors MUST accept UTF-8 and 
> should accept UTF-16."  That embodies two changes:  by referring to 
> "general purpose" it implies that there may be non-general purpose 
> processors written to accept constrained forms of XML (as there surely are 
> whether the Recommendation admits it or not), and it suggests UTF-8 as the 
> lingua franca.
> 
> Anyway, I think it's interesting that users of XML are voting with their 
> feet and imposing requirements for UTF-8 only XML.  Something to watch.
> 
> Noah
> 
> [1] 
> https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapXMLExample
> [2] http://www.w3.org/TR/REC-xml/#charsets
> 
> --------------------------------------
> Noah Mendelsohn 
> IBM Corporation
> One Rogers Street
> Cambridge, MA 02142
> 1-617-693-4036
> --------------------------------------
> 
> 
> 
> 
> 
> 

-- 

Ivan Herman
W3C Communications Team, Head of Offices
C/o W3C Benelux Office at CWI, Kruislaan 413
1098SJ Amsterdam, The Netherlands
tel: +31-20-5924163; mobile: +31-641044153;
URL: http://www.w3.org/People/Ivan/

Received on Sunday, 12 June 2005 07:25:05 UTC