RE: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from Phillips, Addison on 2013-12-20 (www-international@w3.org from October to December 2013)

From: Phillips, Addison <addison@lab126.com>
Date: Fri, 20 Dec 2013 16:25:55 +0000
To: Henri Sivonen <hsivonen@hsivonen.fi>
CC: "www-international@w3.org" <www-international@w3.org>
Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB517BF48DD@ex10-mbx-36009.ant.amazon.com>

> 
> On Fri, Dec 20, 2013 at 5:48 PM, Phillips, Addison <addison@lab126.com>
> wrote:
> > While I tend to agree that declaring the encoding (any encoding) should be
> encouraged, I find it somehow strange that the one encoding that can be pretty
> reliably detected from its bits and which we want to encourage all pages to use
> is the one encoding we DON'T detect or guess? Shouldn't getting the page right
> be more important than punishing the authors?
> 
> UTF-8 detection is reliable if the detector has the entire byte stream available
> at the time of detection. It would be feasible to make it so for file: URLs. But
> not for http[s] URLs.
> 
> In other words, you can't detect UTF-8  reliably when you've only seen one KB
> of plain ASCII bytes, you need to commit to an encoding that <link
> rel=stylesheet>s and <script src>s within that first KB of HTML will inherit and
> you don't yet know what kind of bytes are later in the stream.
> 

UTF-8 detection based on byte sniffing is pretty accurate over very small runs of non-ASCII bytes. If there are no non-ASCII bytes in the first KB of plain text, you're no worse off than you were before. But anything with non-ASCII bytes that matches the UTF-8 encoding pattern is very very unlikely to be anything else--especially given the growing prevalence of UTF-8. The entire byte stream is not necessary to detect that. It's *non* UTF-8 encodings that require as much data as possible to do a heuristic guess.

In any case, I'm not arguing that we should replace the other steps in encoding detection. I'm just noting the irony of going out of our way to *not* detect the encoding we would *prefer* to receive.

Addison

Received on Friday, 20 December 2013 16:26:40 UTC