Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from John Cowan on 2013-12-20 (www-international@w3.org from October to December 2013)

From: John Cowan <cowan@mercury.ccil.org>
Date: Fri, 20 Dec 2013 11:35:53 -0500
To: Henri Sivonen <hsivonen@hsivonen.fi>
Cc: www-international@w3.org
Message-ID: <20131220163553.GF29284@mercury.ccil.org>

Henri Sivonen scripsit:

> The browser UI language is not visible from Google's index, so the
> situation before this proposal is not something that can be determined
> from Google's index.

Not from the index, no.  But it is visible from the clickstream data of
people clicking through Google search results.

> I was thinking of measuring success by comparing Firefox's Character
> Encoding menu usage telemetry data in the last release without this
> feature and the first release with this feature.

That's a reasonable approximation when the guess is way off.  When I see
a page labeled as 8859-1 that is really UTF-8, though, I may or may not
force it to be UTF-8; sometimes I just read through the UTF-8.  I don't
know if that's typical or not.

> Also, a software-only benchmark of TLD-based guessing only works if
> there already is a (near) perfect content-based detector, so there's
> the risk of faulty results if the detector used for comparison is
> faulty.

Granted.  Even so, search engines have a pretty strong incentive to
get encodings right, as it has a big impact on the accuracy of search
results.

>
> However, it would be awesome if someone with access to global Web
> crawl data produced a sample of unlabeled pages under each non-obvious
> TLD (no point in doing this for obviously windows-1252-affiliated TDLs
> like .fi) to allow human inspection of a small sample of pages to
> validate the mapping.

Indeed!

> I've re-read the sentence a few times and I think my sentence makes
> sense: "Should [Isreal] be [on the list of non-participating TLDs] in
> case there's [Arabic encoding] legacy in addition to [Hebrew encoding]
> legacy?"

Ah, I see.  Yes, you are right; the sentence was a bit too elliptical
for me.  The question, then, is whether there's a lot of Arabic content
in .il addresses (as opposed to whether there are a lot of arabophones
in Israel).

-- 
Normally I can handle panic attacks on my own;   John Cowan <cowan@ccil.org>
but panic is, at the moment, a way of life.      http://www.ccil.org/~cowan
                --Joseph Zitt

Received on Friday, 20 December 2013 16:36:15 UTC