- From: Sam Ruby <rubys@intertwingly.net>
- Date: Thu, 15 Jul 2010 16:43:37 -0400
- To: Richard Ishida <ishida@w3.org>
- CC: 'Leif Halvard Silli' <xn--mlform-iua@xn--mlform-iua.no>, 'Anne van Kesteren' <annevk@opera.com>, public-html@w3.org
On 07/15/2010 04:01 PM, Richard Ishida wrote: >> From: Sam Ruby [mailto:rubys@intertwingly.net] >> Sent: 15 July 2010 19:43 > ... >>> UTF-16 encoded XML documents, on the other hand, must start with a >>> BOM, see http://www.w3.org/TR/REC-xml/#charencoding When the doc is >>> treated as XML, however, the meta element is ignored. >>> >>> Do you agree? >> >> I disagree that UTF-16 encoded XML document must start with a BOM. See: >> >> http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info >> >> My personal experience (which now may be dated) is that there are a >> number of XML parsers that choke in the presence of a BOM. But even if >> we ignore such, there still are quite a few ways to go given this set of >> information. > > Although that section includes methods of detecting utf-16 as well as other > encodings, I'm basing my conclusion on the following text: > > "Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin > with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], > section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). > This is an encoding signature, not part of either the markup or the > character data of the XML document. XML processors MUST be able to use this > character to differentiate between UTF-8 and UTF-16 encoded documents." > (http://www.w3.org/TR/REC-xml/#charencoding ) > > Which seems pretty clear. I will check with some XML gurus. There is no question that BOMs are allowed by the specification. In my prior work with SOAP based Web Services[1] and with feeds, however, I found that there were a number of cases where this part of the specification was not respected. In fact, not all "XML" parsers support utf-16[2]. >> It turns out that<meta charset="utf-16"/> will always be ignored but >> the content processed correctly if the content is correctly encoded as >> utf-16. I gather that Richard would prefer that such elements not be >> treated as conformance errors, whereas Ian would prefer that such >> elements be treated as conformance errors. > > I have two reasons for my preference. > > [1] some people will probably add meta elements when using utf-16 encoded > documents, and there's not any harm in it that I can see, so no real reason > to penalise them for it. > > [2] i18n folks have long advised that you should always include a visible > indication of the encoding in a document, HTML or XML, even if you don't > strictly need to, because it can be very useful for developers, testers, or > translation production managers who want to visually check the encoding of a > document. OK, then I'd suggest opening a bug report first against the HTML5 spec suggesting that it be allowed in both the HTML and XHTML serializations. http://tinyurl.com/2vvv8vz Depending on how that bug is resolved, a subsequent bug report against the polyglot spec would be in order. Again, I wouldn't object to allowing the combination of utf-16 encoded content with corresponding meta tags, it just isn't something that I would personally recommend. It just seems to me that it is clearly in the "there be dragons" territory, and given the precipices on both sides of the narrow common path between XML and HTML5, it seems to me to be to be an unnecessary distraction. My (mild) preference is that encodings other that utf-8 be relegated to a footnote which mentions the possibility and perhaps outlines a few of the dangers. What we have now is a top level section which permits more than is currently considered to be conforming (and in the case of some non-ASCII based encodings, more than actually will work interoperably), as well as disallowing combinations that will work but may not be recommended (utf-8 without either a meta tag or an XML declaration). - Sam Ruby [1] http://stackoverflow.com/questions/56812/bom-not-expected-in-cf-but-sent-by-iis-sp [2] http://www.intertwingly.net/blog/2004/06/03/Aggregator-utf-16-tests.html
Received on Thursday, 15 July 2010 20:44:11 UTC