Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization

Henri Sivonen, Thu, 19 Dec 2013 16:29:37 +0200:

> Chrome seems to have content-based detection for a broader
> range of encodings. (Why?)

Do you mean Google’s “automatic” option found in its encoding (sub) 
menu? By default, that option isn’t enabled. (I wonder if it used to be 
enabled earlier.) When enabled, it e.g. guesses UTF-8, regardless of 
user’s locale, and it also guesses other encodings according to some 
scheme. (Btw, in Blink-based Opera, there is no (sub)menu for selecting 
the encoding at all.)

> Considering that the encoding of the content browsed is not really a
> function of the UI localization of the browser, though the two are
> often correlated, I have developed a patch for Firefox to make the
> guess based on the top-level domain name of the URL of the document
> when possible.
   ...
> Does this seem like a good idea? Good idea if the mapping details are
> tweaked? Bad idea? (Why?)

This function then supposedly steps into the encoding detection 
algorithm at the same places where the locale based fallback currently 
steps in. As such, it isn’t a given that this method would *replace* 
detection. However, it sounds like you consider it as an alternative to 
detection.

Negatives: 

* When I filed a bug to get Webkit to do some UTF-8 sniffing, I was 
told, as negative thing, that users would then rely on it (instead of 
label their code as UTF-8) and that this could decrease 
interoperability. It seems like this feature could potentially have the 
same effect.

* A site might have two domain names, such as foo.com and foo.ru, for 
which one would get different results.

* It might be not be very useful. It could perhaps have been more 
useful if it was introduced earlier. At the moment, after 30 minutes on 
Google trying to locate some Cyrillic "Latinifications (such as а 
Windows "я"/"Я" represented as "ÿ"/"ß"), I did not locate a single page 
where this was the but where the page was *not* labeled as UTF-8 or 
some other mishap of that kind. (But this might of course be because 
Google act as a localized browser when it finds "ÿ"/"ß" in unlabeled 
page under .ru.

> # Goals
> 
>  * Reduce the effect of browser configuration (localization) on how
> the Web renders.

Are there many examples, in real life (nowadays), where different 
localization causes different rendering?

>  * Make it easier for people to read legacy content on the Web
> across-locales without having to use the Character Encoding menu.
> 
>  * Address the potential use cases for Firefox's old
> never-on-by-default combined Chinese, CJK and Universal (not actually
> universal!) detectors without the downsides of heuristic detection.
> 
>  * Avoid introducing new fallback encoding guesses that don't already
> result from guessing based on the browser localization.

The goals sound noble enough.

> # Why is this better that analyzing the content itself?

Browsers would still be free to do that - e.g. for detecting UTF-8? 
After all (I assume) that is a step later in the encoding detection 
algorithm.)

> # How could this be harmful?

>  * This probably lowers the incentive to declare the legacy encoding 
> a little.

Agree.
-- 
leif halvard silli

Received on Thursday, 19 December 2013 17:38:07 UTC