[whatwg] Character encoding of document.open()ed documents from And Clover on 2010-04-01 (public-whatwg-archive@w3.org from March 2010)

From: And Clover <and-py@doxdesk.com>
Date: Thu, 01 Apr 2010 05:26:32 +0200
Message-ID: <4BB41268.1060509@doxdesk.com>

Henri Sivonen wrote:

> Spec change request: Please change the spec to say that document.open()
> sets the document's character encoding to UTF-8

+1. UTF-16 is a troublesome encoding for [X]HTML[5] documents and should 
be consistently discouraged; as a ASCII-non-superset it interacts very 
poorly with byte interfaces in HTTP and form submissions.

No browser will actually try to submit a form as UTF-16 for this reason, 
but it still causes problems. eg. Firefox misleadingly sets the 
`_charset_` hack field to 'UTF-16' even though the submission is 
UTF-8-encoded.

> even though the parser operates on UTF-16 DOMStrings.

The term 'UTF-16' can mean two very different things: either a sequence 
of 16-bit code units (as in DOMString), or a sequence of bytes which, 
taken as UTF-16LE or UTF-16BE, represent 16-code units. Unicode's 
tradition of conflating the meanings of the code unit sequence and the 
byte sequence has caused much confusion.

DOM Level 3 LS made the mistake of saying that because DOMStrings are 
UTF-16-code-units, XML documents parsed from 
`LSInput.characterStream`/`StringData` should receive the `encoding` 
'UTF-16', as if the parser has done a conversion from UTF-16-bytes to 
characters, though no such process has actually taken place. 
Consequently when you serialise a document parsed from a string in DOM 
Level 3 LS you get an unexpected and unwanted UTF-16 document.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/

Received on Wednesday, 31 March 2010 20:26:32 UTC