Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from Henri Sivonen on 2013-12-20 (www-international@w3.org from October to December 2013)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Fri, 20 Dec 2013 16:15:40 +0200
To: www-international@w3.org
Message-ID: <CANXqsRJDWfr7LPzgfSvJoFNLh0sG7GkGhFeZNKryWaurgpcsQg@mail.gmail.com>
On Thu, Dec 19, 2013 at 7:37 PM, Leif Halvard Silli
<xn--mlform-iua@xn--mlform-iua.no> wrote:
> Henri Sivonen, Thu, 19 Dec 2013 16:29:37 +0200:
>
>> Chrome seems to have content-based detection for a broader
>> range of encodings. (Why?)
>
> Do you mean Google’s “automatic” option found in its encoding (sub)
> menu? By default, that option isn’t enabled.

I mean the code seen at
https://mxr.mozilla.org/chromium/source/src/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp#40

It is unclear to me under which conditions this code runs.

> (Btw, in Blink-based Opera, there is no (sub)menu for selecting
> the encoding at all.)

Indeed. Unfortunately, that might change:
https://twitter.com/odinho/status/408682238247972864

> This function then supposedly steps into the encoding detection
> algorithm at the same places where the locale based fallback currently
> steps in.

Right. It adds a step right before that step. The guessing continues
to the localization-based guess if there's no TLD (no host or IP
address host) or the TLD is one that's listed as not participating in
this scheme (e.g. .com).

> As such, it isn’t a given that this method would *replace*
> detection. However, it sounds like you consider it as an alternative to
> detection.

I don't expect it to serve as an alternative to detection that ships
in Firefox enabled by default. (Firefox ships enables locale-specific
content-based detectors for the Japanese, Russian and Ukrainian
localizations by default. These detectors are designed to choose
between various Japanese, Russian or Ukrainian encodings,
respectively. In other locales, content-based detection is off by
default. I suspect we might even be able to remove the Russian and
Ukrainian detectors.) However, the feature is inspired by the disabled
by default detectors of Firefox.

Specifically, it's inspired by this case:
Suppose you are in Hong Kong or Taiwan and run the Traditional Chinese
localization of Firefox. This means that the fallback encoding
(currently without this feature I'm proposing) is Big5, which makes
sense for unlabeled legacy Traditional Chinese content. However, this
fallback fails if you want to read Simplified Chinese legacy content
published in the PRC. However, to the extent that content is published
under the .cn TLD, you'd have a better experience if the .cn TLD made
your browser apply the guess that the Simplified Chinese localization
currently.

This proposal is a generalization from that idea by making the
guessing apply to more than .cn, .hk and .tw.

> Negatives:
>
> * When I filed a bug to get Webkit to do some UTF-8 sniffing, I was
> told, as negative thing, that users would then rely on it (instead of
> label their code as UTF-8) and that this could decrease
> interoperability.

This proposal never ends up guessing UTF-8, so this never gives anyone
a reason not to declare UTF-8.

> It seems like this feature could potentially have the
> same effect.

The guessing outcomes that can arise from this proposal can already
arise from existing localizations. So if a site is broken by this
change, it would already have appeared broken in the exact same way in
some existing browser localization. In that sense, what ever
interoperability problem this might cause is a pre-existing problem.

But you are right that if browser A implements the proposal and
browser B does not, localization of A for language L and localization
of B for the *same* language L could render a given site differently.

> * A site might have two domain names, such as foo.com and foo.ru, for
> which one would get different results.

True, but it strictly makes the site work more often. (I assume you
mean it's a Russian-language site that uses unlabeled windows-1251.)

Before this change, the site under both domains works in browser
localizations for languages that use the Cyrillic script and,
therefore, use windows-1251 as the fallback encoding and fails in
other browser localizations under both domains.

With this proposal, the situation in the .com case is unchanged, but
the .ru case starts working in all browser localizations.

> * It might be not be very useful. It could perhaps have been more
> useful if it was introduced earlier.

We are definitely into the diminishing returns department already. If
you look at the Firefox sessions in which the Character Encoding menu
is used relative to all Firefox sessions, you might as well round it
to complete unuse:
http://telemetry.mozilla.org/#release/26/CHARSET_OVERRIDE_USED

However, if you break the data down by locale
(https://bug906032.bugzilla.mozilla.org/attachment.cgi?id=796536) we
learn the following:
 * The fallbacks still matter, because localizations that chose an
inappropriate fallback (UTF-8 which isn't a legacy encoding; fixed
already) top the list.
 * If it weren't for those localizations, Simplified Chinese (zh-TW;
Firefox uses that code instead of zh-Hant for historical reasons)
would top the list, which suggest the use case inspiring this proposal
(see above) is relevant still.
 * Curiously, Korean is rather high on the list even though Korean has
a single non-windows-1252 legacy encoding, so all things being equal,
it shouldn't be higher than e.g. Hebrew, so all things apparently
aren't equal.

Because the CJK locales stand out, the time is not yet ripe for
getting rid of the Character Encoding menu. I'm hoping that especially
guessing based on the CJK-related TLDs would reduce the need of the
users of the CJK locales (zh-TW especially) to resort to the Character
Encoding menu, so that down the road, we could get rid of the menu and
the cost of dealing with all the edge cases associated with having the
menu. OTOH, I want to avoid the Character Encoding menu getting
introduced to Firefox OS.

> At the moment, after 30 minutes on
> Google trying to locate some Cyrillic "Latinifications (such as а
> Windows "я"/"Я" represented as "ÿ"/"ß"), I did not locate a single page
> where this was the but where the page was *not* labeled as UTF-8 or
> some other mishap of that kind. (But this might of course be because
> Google act as a localized browser when it finds "ÿ"/"ß" in unlabeled
> page under .ru.

As John Cowan noted, Google Search is likely papering over problems of
this nature, so you can't locate the problems using Google Search.

> Are there many examples, in real life (nowadays), where different
> localization causes different rendering?

The languages that are associated with a fallback other than
windows-1252 when used as a Firefox UI language are listed in
https://mxr.mozilla.org/mozilla-central/source/dom/encoding/localesfallbacks.properties

Since the list is non-empty, there evidently is opportunity for
different localizations to render unlabeled sites differently.

> The goals sound noble enough.
>
>> # Why is this better that analyzing the content itself?
>
> Browsers would still be free to do that - e.g. for detecting UTF-8?

Except maybe for file: URLs (where we could make the UA look at the
whole file before displaying anything), I think detecting UTF-8 is a
bad idea  for reasons briefly mentioned in  the message that started
this thread.  However, I am not advocating for a ban on all
content-based sniffing, because I'm currently in the belief that we
won't be able to get rid of sniffing for the various Japanese
encodings. Other than sniffing for Japanese encodings, I'd be in favor
of banning content-based sniffing (unless research shows that we also
really need it for the Russian and Ukrainian cases). But that's
outside the scope of this proposal.

On Thu, Dec 19, 2013 at 9:14 PM, John Cowan <cowan@mercury.ccil.org> wrote:
> Henri Sivonen scripsit:
>
>> Chrome seems to have content-based detection for a broader range of
>> encodings. (Why?)
>
> Presumably because they believe it improves the user experience;
> I don't know for sure.

Presumably, of course.

I was hoping that someone involved in introducing this difference
between Chrome and Safari would elaborate on the motivation in more
detail.

> What I do know is that Google search attempts to
> convert every page it spiders to UTF-8, and that they rely on encoding
> detection rather than (or in addition to) declared encodings.  In
> particular, certain declared encodings such as US-ASCII, 8859-1, and
> Windows-1252, are considered to provide *no* encoding information.

Indeed, considering Firefox telemetry data globally, it is more common
to use the Character Encoding menu to override a declared encoding
than to override a bad fallback guess. (Sadly, it's also more common
to override a previous override that to override a bad fallback guess,
which means that users who do use the menu do it by trial and error.)
See http://telemetry.mozilla.org/#release/26/CHARSET_OVERRIDE_SITUATION
for the numbers and
https://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#1992
for what the buckets mean.

Still, I think browsers should not start disrespecting site-supplied
encoding labels. Browsers disrespecting site-supplied MIME types
turned into enough of a mess. (Though the blame falls on server
developers, specifically Apache developers refusing for years to fix
their bug. In both cases, I strongly disapprove of default server
configurations supplying values, because when the values come from
server defaults, they may have no relation to actual content served.)

> Before modifying existing encoding-detection schemes, I would ask
> someone at Google (or another company that spiders the Web extensively)
> to find out just how much superior the revised scheme would be when
> applied to the existing Web, rather than trusting to _a priori_
> arguments.

This proposal doesn't change  anything except the step that currently
is based on guessing from the browser UI language. The browser UI
language is not visible from Google's index,  so the situation before
this proposal is not something that can be determined from Google's
index. I was thinking of measuring success by comparing Firefox's
Character Encoding menu usage telemetry data in the last release
without this feature and the first release with this feature.

Also, a software-only benchmark of TLD-based guessing only works if
there already is a (near) perfect content-based detector, so there's
the risk of faulty results if the detector used for comparison is
faulty.

However, it would be awesome if someone with access to global Web
crawl data produced a sample of unlabeled pages under each non-obvious
TLD (no point in doing this for obviously windows-1252-affiliated TDLs
like .fi) to allow human inspection of a small sample of pages to
validate the mapping.

>>  * The domain name is a country TLD whose legacy encoding affiliation
>> I couldn't figure out: .ba, .cy, .my. (Should .il be here in case
>> there's windows-1256 legacy in addition to windows-1255 legacy?)
>
> 1256 is Arabic, 1255 is Hebrew,

Right.

> so I assume you meant the other way around.

I've re-read the sentence a few times and I think my sentence makes
sense: "Should [Isreal] be [on the list of non-participating TLDs] in
case there's [Arabic encoding] legacy in addition to [Hebrew encoding]
legacy?" I think the question is reasonable given 18% of the
population (according to Wikipedia) report Arabic as their mother
tongue and all major browsers have Arabic localizations.

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/
Received on Friday, 20 December 2013 14:16:09 UTC