- From: John Cowan <cowan@mercury.ccil.org>
- Date: Fri, 20 Dec 2013 11:35:53 -0500
- To: Henri Sivonen <hsivonen@hsivonen.fi>
- Cc: www-international@w3.org
Henri Sivonen scripsit:
> The browser UI language is not visible from Google's index, so the
> situation before this proposal is not something that can be determined
> from Google's index.
Not from the index, no. But it is visible from the clickstream data of
people clicking through Google search results.
> I was thinking of measuring success by comparing Firefox's Character
> Encoding menu usage telemetry data in the last release without this
> feature and the first release with this feature.
That's a reasonable approximation when the guess is way off. When I see
a page labeled as 8859-1 that is really UTF-8, though, I may or may not
force it to be UTF-8; sometimes I just read through the UTF-8. I don't
know if that's typical or not.
> Also, a software-only benchmark of TLD-based guessing only works if
> there already is a (near) perfect content-based detector, so there's
> the risk of faulty results if the detector used for comparison is
> faulty.
Granted. Even so, search engines have a pretty strong incentive to
get encodings right, as it has a big impact on the accuracy of search
results.
>
> However, it would be awesome if someone with access to global Web
> crawl data produced a sample of unlabeled pages under each non-obvious
> TLD (no point in doing this for obviously windows-1252-affiliated TDLs
> like .fi) to allow human inspection of a small sample of pages to
> validate the mapping.
Indeed!
> I've re-read the sentence a few times and I think my sentence makes
> sense: "Should [Isreal] be [on the list of non-participating TLDs] in
> case there's [Arabic encoding] legacy in addition to [Hebrew encoding]
> legacy?"
Ah, I see. Yes, you are right; the sentence was a bit too elliptical
for me. The question, then, is whether there's a lot of Arabic content
in .il addresses (as opposed to whether there are a lot of arabophones
in Israel).
--
Normally I can handle panic attacks on my own; John Cowan <cowan@ccil.org>
but panic is, at the moment, a way of life. http://www.ccil.org/~cowan
--Joseph Zitt
Received on Friday, 20 December 2013 16:36:15 UTC