RE: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization

> 
> On Fri, Dec 20, 2013 at 5:48 PM, Phillips, Addison <addison@lab126.com>
> wrote:
> > While I tend to agree that declaring the encoding (any encoding) should be
> encouraged, I find it somehow strange that the one encoding that can be pretty
> reliably detected from its bits and which we want to encourage all pages to use
> is the one encoding we DON'T detect or guess? Shouldn't getting the page right
> be more important than punishing the authors?
> 
> UTF-8 detection is reliable if the detector has the entire byte stream available
> at the time of detection. It would be feasible to make it so for file: URLs. But
> not for http[s] URLs.
> 
> In other words, you can't detect UTF-8  reliably when you've only seen one KB
> of plain ASCII bytes, you need to commit to an encoding that <link
> rel=stylesheet>s and <script src>s within that first KB of HTML will inherit and
> you don't yet know what kind of bytes are later in the stream.
> 

UTF-8 detection based on byte sniffing is pretty accurate over very small runs of non-ASCII bytes. If there are no non-ASCII bytes in the first KB of plain text, you're no worse off than you were before. But anything with non-ASCII bytes that matches the UTF-8 encoding pattern is very very unlikely to be anything else--especially given the growing prevalence of UTF-8. The entire byte stream is not necessary to detect that. It's *non* UTF-8 encodings that require as much data as possible to do a heuristic guess.

In any case, I'm not arguing that we should replace the other steps in encoding detection. I'm just noting the irony of going out of our way to *not* detect the encoding we would *prefer* to receive.

Addison

Received on Friday, 20 December 2013 16:26:40 UTC