- From: Phillips, Addison <addison@lab126.com>
- Date: Fri, 20 Dec 2013 16:25:55 +0000
- To: Henri Sivonen <hsivonen@hsivonen.fi>
- CC: "www-international@w3.org" <www-international@w3.org>
> > On Fri, Dec 20, 2013 at 5:48 PM, Phillips, Addison <addison@lab126.com> > wrote: > > While I tend to agree that declaring the encoding (any encoding) should be > encouraged, I find it somehow strange that the one encoding that can be pretty > reliably detected from its bits and which we want to encourage all pages to use > is the one encoding we DON'T detect or guess? Shouldn't getting the page right > be more important than punishing the authors? > > UTF-8 detection is reliable if the detector has the entire byte stream available > at the time of detection. It would be feasible to make it so for file: URLs. But > not for http[s] URLs. > > In other words, you can't detect UTF-8 reliably when you've only seen one KB > of plain ASCII bytes, you need to commit to an encoding that <link > rel=stylesheet>s and <script src>s within that first KB of HTML will inherit and > you don't yet know what kind of bytes are later in the stream. > UTF-8 detection based on byte sniffing is pretty accurate over very small runs of non-ASCII bytes. If there are no non-ASCII bytes in the first KB of plain text, you're no worse off than you were before. But anything with non-ASCII bytes that matches the UTF-8 encoding pattern is very very unlikely to be anything else--especially given the growing prevalence of UTF-8. The entire byte stream is not necessary to detect that. It's *non* UTF-8 encodings that require as much data as possible to do a heuristic guess. In any case, I'm not arguing that we should replace the other steps in encoding detection. I'm just noting the irony of going out of our way to *not* detect the encoding we would *prefer* to receive. Addison
Received on Friday, 20 December 2013 16:26:40 UTC