UTF-16, UTF-16BE and UTF-16LE in HTML5 from Richard Ishida on 2010-07-26 (www-international@w3.org from July to September 2010)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 26 Jul 2010 19:52:09 +0100
To: <public-html@w3.org>, <www-international@w3.org>
Message-ID: <030701cb2cf3$a6eebd10$f4cc3730$@org>
 [bringing in www-international]

This is a follow on from the thread at
http://lists.w3.org/Archives/Public/public-html/2010Jul/0030.html with
subject renamed.  You should read that thread if you haven't already.

I have summarised in simplified and graphic form my understanding of the
algorithm in html5 for detecting character encodings. See
http://www.w3.org/International/2010/07/html5-encoding-detection.png 

This discussion is about what happens where there is no encoding information
in the transport layer.

Please see the explanation from François Yergeau below about use of BOM and
UTF-16, UTF-16BE and UTF-16LE (forwarded with permission). As I understand
it, you should use a BOM if you have identified or labelled the content as
'UTF-16', ie. with no indication of the endianness. The Unicode Standard
also says that if you have labelled or identified your text as 'UTF-16BE' or
'UTF-16LE', you should not use a BOM (since it should be interpreted as a
word joiner at the start of the text).

HTML5 says:
"If an HTML document does not start with a BOM, and if its encoding is not
explicitly given by Content-Type metadata, and the document is not an iframe
srcdoc document, then the character encoding used must be an
ASCII-compatible character encoding..."
http://dev.w3.org/html5/spec/semantics.html#charset 

This rules out the use of UTF-16BE and UTF16-LE character encodings, since
they should not start with a BOM.

A little later, the spec says
"If an HTML document contains a meta element with a charset attribute or a
meta  element with an http-equiv  attribute in the Encoding declaration
state, then the character encoding used must be an ASCII-compatible
character encoding."

This rules out the use of a character encoding declaration with the value
UTF-16, even in content that is encoded in that encoding. 

I earlier stated my preference to be able to say that a document is encoded
in UTF-16 in the encoding declaration (in UTF-16 encoded documents, of
course), because:

[1] some people will probably add meta elements when using utf-16 encoded
documents, and there's not any harm in it that I can see, so no real reason
to penalise them for it.

[2] i18n folks have long advised that you should always include a visible
indication of the encoding in a document, HTML or XML, even if you don't
strictly need to, because it can be very useful for developers, testers, or
translation production managers who want to visually check the encoding of a
document.

I suppose, by logical extension, people will expect that it is also possible
to say that a document is encoded in UTF-16BE and UTF16-LE in the
declaration. That could also lead to an expectation that the encoding
declaration would actually be used to determine the encoding in such cases,
since the file should not then start with a BOM.  In fact, in that case, the
encoding detection would currently be relegated to the browser's
autodetection algorithms, and the spec doesn't currently specify that they
should recognise UTF-16BE and UTF16-LE, afaia.  The alternative may be to
make it clearer that, although UTF-16 is ok, HTML5 and XHTML5 do not accept
UTF-16BE and UTF16-LE encoding declarations - only UTF-16 with a BOM (which
of course covers the same serialisations). 

One way or the other, this appears to constitute another difference between
former XHTML/XML documents and the new polyglot docs which should probably
be documented.

What do people think?

RI






From: François Yergeau [mailto:francois@yergeau.com] 
Sent: 15 July 2010 22:43
To: Richard Ishida
Cc: 'Henry S. Thompson'; msm@w3.org
Subject: Re: FW: i18n comments on Polyglot Markup

Le 2010-07-15 13:06, Richard Ishida a écrit :
> Can you give me any definitive answers on the questions of whether XML
> requires a BOM for UTF-16 encoded documents, and whether XML processors
> choke on the BOM?

It depends on what you mean by "UTF-16 encoded documents".  In the XML 
spec, a "document in the UTF-16 encoding" means (somewhat strangely, I 
would agree) that the document is actually in UTF-16 (OK so far) and 
that the encoding has been identified as "UTF-16".  Not "UTF-16BE" or 
"UTF-16LE", these are different beasts, even though the actual encoding 
is of course the same.  See the third sentence of the first para in 
4.3.3 (http://www.w3.org/TR/REC-xml/#charencoding):

"The terms "UTF-8" and "UTF-16" in this specification do not apply to 
related character encodings, including but not limited to UTF-16BE, 
UTF-16LE, or CESU-8."

So XML parsers are not strictly required to grok UTF-16 documents 
labelled as UTF-16BE/LE.  And the BOM requirement (next sentence in 
4.3.3) does not apply for such documents.

The "UTF-16BE" and "UTF-16LE" labels are defined in RFC 2781, which says 
(Sec. 3.3): "Systems labelling UTF-16BE text MUST NOT prepend a BOM to 
the text" and ditto for UTF-16LE.  This of course applies to XML.

So it all depends on how you label your UTF-16 encoded documents.  If 
you label them UTF-16BE/LE, no BOM is allowed (RFC 2781).  If you label 
them UTF-16, or do not label them (ill-advised), then a BOM is required 
(SHOULD ffrom RFC 2781, MUST from XML spec).

As for parsers choking on the BOM, I have no actual experience, but I 
would consider it much more likely with UTF-8 (MAY in XML spec) than 
with UTF-16.  The BOM requirement in UTF-16 goes back to the first 
edition of XML, whereas the explicit allowance for UTF-8 came with the 
3rd edition (2003).  I would suspect that stories about this choking 
date back to when Microsoft started making things like Notepad write out 
a BOM when saving in UTF-8, which is what triggered the clarification in 
XML 3rd edition. UTF-8 BOM was never explicitly disallowed, but people 
generally thought of the BOM only as a byte order mark, not as the 
encoding signature that it really is.  Hence some parsers were not 
prepared when it started appearing in UTF-8.

My 2¢.

-- 
François
Received on Monday, 26 July 2010 18:52:39 UTC