Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from Henri Sivonen on 2013-12-20 (www-international@w3.org from October to December 2013)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Fri, 20 Dec 2013 18:15:38 +0200
To: "Phillips, Addison" <addison@lab126.com>
Cc: "www-international@w3.org" <www-international@w3.org>
Message-ID: <CANXqsRLXuuKDRv=xABcJjeA_8p=w487-C-QJsU_wDOH-JwrbOg@mail.gmail.com>

On Fri, Dec 20, 2013 at 5:48 PM, Phillips, Addison <addison@lab126.com> wrote:
> While I tend to agree that declaring the encoding (any encoding) should be encouraged, I find it somehow strange that the one encoding that can be pretty reliably detected from its bits and which we want to encourage all pages to use is the one encoding we DON'T detect or guess? Shouldn't getting the page right be more important than punishing the authors?

UTF-8 detection is reliable if the detector has the entire byte stream
available at the time of detection. It would be feasible to make it so
for file: URLs. But not for http[s] URLs.

In other words, you can't detect UTF-8  reliably when you've only seen
one KB of plain ASCII bytes, you need to commit to an encoding that
<link rel=stylesheet>s and <script src>s within that first KB of HTML
will inherit and you don't yet know what kind of bytes are later in
the stream.

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/

Received on Friday, 20 December 2013 16:16:06 UTC