- From: John Cowan <cowan@mercury.ccil.org>
- Date: Fri, 20 Dec 2013 11:35:53 -0500
- To: Henri Sivonen <hsivonen@hsivonen.fi>
- Cc: www-international@w3.org
Henri Sivonen scripsit: > The browser UI language is not visible from Google's index, so the > situation before this proposal is not something that can be determined > from Google's index. Not from the index, no. But it is visible from the clickstream data of people clicking through Google search results. > I was thinking of measuring success by comparing Firefox's Character > Encoding menu usage telemetry data in the last release without this > feature and the first release with this feature. That's a reasonable approximation when the guess is way off. When I see a page labeled as 8859-1 that is really UTF-8, though, I may or may not force it to be UTF-8; sometimes I just read through the UTF-8. I don't know if that's typical or not. > Also, a software-only benchmark of TLD-based guessing only works if > there already is a (near) perfect content-based detector, so there's > the risk of faulty results if the detector used for comparison is > faulty. Granted. Even so, search engines have a pretty strong incentive to get encodings right, as it has a big impact on the accuracy of search results. > > However, it would be awesome if someone with access to global Web > crawl data produced a sample of unlabeled pages under each non-obvious > TLD (no point in doing this for obviously windows-1252-affiliated TDLs > like .fi) to allow human inspection of a small sample of pages to > validate the mapping. Indeed! > I've re-read the sentence a few times and I think my sentence makes > sense: "Should [Isreal] be [on the list of non-participating TLDs] in > case there's [Arabic encoding] legacy in addition to [Hebrew encoding] > legacy?" Ah, I see. Yes, you are right; the sentence was a bit too elliptical for me. The question, then, is whether there's a lot of Arabic content in .il addresses (as opposed to whether there are a lot of arabophones in Israel). -- Normally I can handle panic attacks on my own; John Cowan <cowan@ccil.org> but panic is, at the moment, a way of life. http://www.ccil.org/~cowan --Joseph Zitt
Received on Friday, 20 December 2013 16:36:15 UTC