- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Mon, 23 Dec 2013 00:56:33 +0100
- To: Henri Sivonen <hsivonen@hsivonen.fi>
- Cc: Jungshik SHIN (신정식) <jshin1987@gmail.com>, "www-international@w3.org" <www-international@w3.org>
Henri Sivonen, Sat, 21 Dec 2013 14:16:41 +0200: > On Fri, Dec 20, 2013 at 6:25 PM, Phillips, Addison wrote: >> UTF-8 detection based on byte sniffing is pretty accurate over very >> small runs of non-ASCII bytes. If there are no non-ASCII bytes in >> the first KB of plain text, you're no worse off than you were before. > > No, you'd be worse off than before. > Consider an accidentally unlabeled UTF-8 site whose HTML template > fills the first kilobyte of each page with just pure-ASCII ... > It is a bad idea to introduce such a non-obvious reason for > varying behavior, since it would waste people's time with > wild-goose-chase debugging sessions. Yes, there is a risk that default UTF-8 detection could cause many authors to start to rely on it and that this, in turn, could cause authoring gotchas. In order to understand this, one simply needs to enable encoding detection in Chrome and check out how it works: A non-ASCII comment (<!-- ÆØÅ -->) near the <html> start tag could be the feather that made UTF-8 detection kick in. Removing the comment could be the thing that made UTF-8 detection fail. But there is also a chance - especially if the gotcha becomes a frequent issue - that authors would as well discover how to *trigger* UTF-8 detection. The snowman <http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen> could make a reentrance. Or may be the pile of poo character. Why not simply us a BOM ... Thus, whether UTF-8 detection would lead to *frequent* wild-goose-chase debugging, depends IMO on how well and broad this system of two defaults would be understood and to which degree authors would start to trust UTF-8 detection for handling the encoding. (No doubt, btw, would the TLD-based default also cause debugging sessions.) UTF-8 detection as the last step before fallback would have to be promoted as what it is: two defaults - a preferred default (UTF-8), and a legacy default. And it should remain non-conforming to not declare the encoding as that would help authors to always be aware of the double default issue, and would thus work against the the wild-goose-chase debugging sessions that you predict. What the thought about UTF-8 detection IMO needs is a) a proposal about a specific algorithm and b) good use cases: Would UTF-8 detection make more authors switch to UTF-8? Would UTF-8 detection make it easier to switch to UTF-8? Why does Europe’s largest social network, www.vk.com, use Windows-1251 - even for Asian scripts? Could UTF-8 detection make somethings work better than today? How well would it eventually work as a “political signal”? -- leif halvard silli
Received on Sunday, 22 December 2013 23:57:03 UTC