- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Mon, 23 Apr 2012 11:19:38 +0300
On Sat, Apr 21, 2012 at 1:21 PM, Anne van Kesteren <annevk at opera.com> wrote: > This morning I looked into what it would take to define Encoding Sniffing. > http://wiki.whatwg.org/wiki/Encoding#Sniffing has links as to what I looked > at (minus Opera internal). As far as I can tell Gecko has the most > comprehensive approach and should not be too hard to define (though writing > it all out correctly and clear will be some work). The Gecko notes aren't quite right: * The detector chosen from the UI is used for HTML and plain text when loading those in a browsing context from HTTP GET or from a non-http URL. (Not used for POST responses. Not used for XHR.) * The default for the UI setting depends on the locale. Most locales default to know detector at all. Only zh-TW defaults to the Universal detector. (I'm not sure why, but I think this is a bug of *some* kind. Perhaps the localizer wanted to detect both Traditional and Simplified Chinese encodings and we don't have a detector configuration for Traditional&Simplified.) Other locales that default to having a detector enabled default to a locale-specific detector (e.g. Japanese or Ukranian). * The Universal detector is used regardless of UI setting or locale when using the FileReader to read a local file as text. (I'm personally very unhappy about this sort of use of heuristics in a new feature.) * The Universal detector isn't really universal. In particular, it misdetects Central European encodings like ISO-8859-2. (I'm personally unhappy that we expose the Universal detector in the UI and thereby bait people to enable it.) * Regardless of detector setting, when loading HTML or plain text in a browsing context, Basic Latin encoded as UTF-16BE or UTF-16LE is detected. This detection is not performed by FileReader. > I have some questions though: > > 1) Is this something we want to define and eventually implement the same > way? I think yes in principle. In practice, it might be hard to get this done. E.g. in the case of Gecko, we'd need someone who has no higher priority work than rewriting chardet in compliance with the hypothetical spec. I don't want to enable heuristic detection for all HTML page loads. Yet, it seems that we can't get rid of it for e.g. the Japanese context. (It's so sad that the situation is the worst in places that have multiple encodings and, therefore, logically should be more aware of the need to declare which one is in use. Sigh.) I think it is bad that the Web-exposed behavior of the browser depends on the UI locale of the browser. I think it would be worthwhile research project to find out if that were feasible to trigger language-specific heuristic detection on a per TLD basis instead on a per UI locale basis (e.g. enabling the Japanese detector for all pages loaded from .jp and the Russian detector for all pages loaded from .ru regardless of UI locale and requiring .com Japanese or Russian sites to get their charset act together or maybe having a short list of popular special cases that don't use a country TLD but don't declare the encoding, either). > 2) Does this need to apply outside HTML? For JavaScript it forbidden per the > HTML standard at the moment. CSS and XML do not allow it either. Is it used > for decoding text/plain at the moment? Detection is used for text/plain in Gecko when it would be used for text/html. I think detection shouldn't be used for anything except plain text and HTML being loaded into browsing context considering that we've managed this far without it (well, except for FileReader). (Note that when not declaring the encoding on their own JavaScript and CSS inherit the encoding of the HTML document that references them.) > 3) Is there a limit to how many bytes we should look at? In Gecko, the Basic Latin encoded as UTF-16BE or UTF-16LE check is run on the first 1024 bytes. For the other heuristic detections, there is no limit and changing the encoding potentially causes renavigation to the page. During the Firefox for development cycle, there was a limit of 1024 bytes (no renavigation!), but it was removed in order to support the Japanese Planet Debian (site fixed since then) and other unspecified but rumored Japanese sites. On Sun, Apr 22, 2012 at 2:11 AM, Silvia Pfeiffer <silviapfeiffer1 at gmail.com> wrote: > We've had some discussion on the usefulness of this in WebVTT - mostly > just in relation with HTML, though I am sure that stand-alone video > players that decode WebVTT would find it useful, too. WebVTT is a new format with no legacy. Instead of letting it become infected with heuristic detection, we should go the other direction and hardwire it as UTF-8 like we did with app cache manifests and JSON-in-XHR. No one should be creating new content in encodings other than UTF-8. Those who can't be bothered to use The Encoding deserve REPLACEMENT CHARACTERs. Heuristic detection is for unlabeled legacy content. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Monday, 23 April 2012 01:19:38 UTC