[whatwg] Default encoding to UTF-8? from Henri Sivonen on 2012-04-03 (public-whatwg-archive@w3.org from April 2012)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 3 Apr 2012 14:59:25 +0300
Message-ID: <CAJQvAufCArjo9m_a+qW1ebKgCW9jaZ6JP7QgT+=peDBViExXEQ@mail.gmail.com>
On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli
<xn--mlform-iua at xn--mlform-iua.no> wrote:
>> I mean the performance impact of reloading the page or,
>> alternatively, the loss of incremental rendering.)
>>
>> A solution that would border on reasonable would be decoding as
>> US-ASCII up to the first non-ASCII byte
>
> Thus possibly prescan of more than 1024 bytes?

I didn't mean a prescan.  I meant proceeding with the real parse and
switching decoders in midstream. This would have the complication of
also having to change the encoding the document object reports to
JavaScript in some cases.

>> and then deciding between
>> UTF-8 and the locale-specific legacy encoding by examining the first
>> non-ASCII byte and up to 3 bytes after it to see if they form a valid
>> UTF-8 byte sequence.
>
> Except for the specifics, that sounds like more or less the idea I
> tried to state. May be it could be made into a bug in Mozilla?

It's not clear that this is actually worth implementing or spending
time on its this stage.

> However, there is one thing that should be added: The parser should
> default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII.

That would break form submissions.

>> But trying to gain more statistical confidence
>> about UTF-8ness than that would be bad for performance (either due to
>> stalling stream processing or due to reloading).
>
> So here you say tthat it is better to start to present early, and
> eventually reload [I think] if during the presentation the encoding
> choice shows itself to be wrong, than it would be to investigate too
> much and be absolutely certain before starting to present the page.

I didn't intend to suggest reloading.

>> Adding autodetection wouldn't actually force authors to use UTF-8, so
>> the problem Faruk stated at the start of the thread (authors not using
>> UTF-8 throughout systems that process user input) wouldn't be solved.
>
> If we take that logic to its end, then it would not make sense for the
> validator to display an error when a page contains a form without being
> UTF-8 encoded, either. Because, after all, the backend/whatever could
> be non-UTF-8 based. The only way to solve that problem on those
> systems, would be to send form content as character entities. (However,
> then too the form based page should still be UTF-8 in the first place,
> in order to be able to take any content.)

Presumably, when an author reacts to an error message, (s)he not only
fixes the page but also the back end.  When a browser makes encoding
guesses, it obviously cannot fix the back end.

> [ Original letter continued: ]
>>> Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
>>> detection. So it might still be an competitive advantage.
>>
>> It would be interesting to know what exactly Chrome does. Maybe
>> someone who knows the code could enlighten us?
>
> +1 (But their approach looks similar to the 'border on sane' approach
> you presented. Except that they seek to detect also non-UTF-8.)

I'm slightly disappointed but not surprised that this thread hasn't
gained a message explaining what Chrome does exactly.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 3 April 2012 04:59:25 UTC