[whatwg] Default encoding to UTF-8? from Henri Sivonen on 2012-01-03 (public-whatwg-archive@w3.org from January 2012)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 3 Jan 2012 10:50:26 +0200
Message-ID: <CAJQvAufj_vg6iHPLjzK4f0sHhnB8Oj+ShJkkJR7kVYd4Ymfznw@mail.gmail.com>

On Tue, Jan 3, 2012 at 10:33 AM, Henri Sivonen <hsivonen at iki.fi> wrote:
> A solution that would border on reasonable would be decoding as
> US-ASCII up to the first non-ASCII byte and then deciding between
> UTF-8 and the locale-specific legacy encoding by examining the first
> non-ASCII byte and up to 3 bytes after it to see if they form a valid
> UTF-8 byte sequence. But trying to gain more statistical confidence
> about UTF-8ness than that would be bad for performance (either due to
> stalling stream processing or due to reloading).

And it's worth noting that the above paragraph states a "solution" to
the problem that is: "How to make it possible to use UTF-8 without
declaring it?"

Adding autodetection wouldn't actually force authors to use UTF-8, so
the problem Faruk stated at the start of the thread (authors not using
UTF-8 throughout systems that process user input) wouldn't be solved.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Tuesday, 3 January 2012 00:50:26 UTC