W3C home > Mailing lists > Public > www-international@w3.org > October to December 2013

Re: Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Mon, 23 Dec 2013 00:56:33 +0100
To: Henri Sivonen <hsivonen@hsivonen.fi>
Cc: Jungshik SHIN (신정식) <jshin1987@gmail.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <20131223005633032670.d6fb2016@xn--mlform-iua.no>
Henri Sivonen, Sat, 21 Dec 2013 14:16:41 +0200:
> On Fri, Dec 20, 2013 at 6:25 PM, Phillips, Addison wrote:

>> UTF-8 detection based on byte sniffing is pretty accurate over very 
>> small runs of non-ASCII bytes. If there are no non-ASCII bytes in 
>> the first KB of plain text, you're no worse off than you were before.
> No, you'd be worse off than before.

> Consider an accidentally unlabeled UTF-8 site whose HTML template
> fills the first kilobyte of each page with just pure-ASCII
> It is a bad idea to introduce such a non-obvious reason for
> varying behavior, since it would waste people's time with
> wild-goose-chase debugging sessions.

Yes, there is a risk that default UTF-8 detection could cause many 
authors to start to rely on it and that this, in turn, could cause 
authoring gotchas. In order to understand this, one simply needs to 
enable encoding detection in Chrome and check out how it works: A 
non-ASCII comment (<!-- ÆØÅ -->) near the <html> start tag could be the 
feather that made UTF-8 detection kick in. Removing the comment could 
be the thing that made UTF-8 detection fail.

But there is also a chance - especially if the gotcha becomes a 
frequent issue - that authors would as well discover how to *trigger* 
UTF-8 detection. The snowman 
<http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen> could make 
a reentrance. Or may be the pile of poo character. Why not simply us a 
BOM ...

Thus, whether UTF-8 detection would lead to *frequent* wild-goose-chase 
debugging, depends IMO on how well and broad this system of two 
defaults would be understood and to which degree authors would start to 
trust UTF-8 detection for handling the encoding. (No doubt, btw, would 
the TLD-based default also cause debugging sessions.)

UTF-8 detection as the last step before fallback would have to be 
promoted as what it is: two defaults - a preferred default (UTF-8), and 
a legacy default. And it should remain non-conforming to not declare 
the encoding as that would help authors to always be aware of the 
double default issue, and would thus work against the the 
wild-goose-chase debugging sessions that you predict.

What the thought about UTF-8 detection IMO needs is a) a proposal about 
a specific algorithm and b) good use cases: Would UTF-8 detection make 
more authors switch to UTF-8? Would UTF-8 detection make it easier to 
switch to UTF-8? Why does Europe’s largest social network, www.vk.com, 
use Windows-1251 - even for Asian scripts? Could UTF-8 detection make 
somethings work better than today? How well would it eventually work as 
a “political signal”?
leif halvard silli
Received on Sunday, 22 December 2013 23:57:03 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:41:03 UTC