Re: Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from Henri Sivonen on 2013-12-21 (www-international@w3.org from October to December 2013)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Sat, 21 Dec 2013 14:16:41 +0200
To: Jungshik SHIN (신정식) <jshin1987@gmail.com>
Cc: "www-international@w3.org" <www-international@w3.org>
Message-ID: <CANXqsR+H1y=8DOrr2RweAV1s3oiBJzgFK54SwUiEv698kZPqQw@mail.gmail.com>

On Fri, Dec 20, 2013 at 10:58 PM, Jungshik SHIN (신정식)
<jshin1987@gmail.com> wrote:
> On Dec 19, 2013 11:16 AM, "John Cowan" <cowan@mercury.ccil.org> wrote:
>> Henri Sivonen scripsit:
>>
>> > Chrome seems to have content-based detection for a broader range of
>> > encodings. (Why?)
>>
>> Presumably because they believe it improves the user experience;
>
> It is off by default. Even when it is on, it is only used in absence of an
> explicit declaration either via http c-t header or meta tag. It never
> overides the declared encoding.

Cool. Why did you feel it was worthwhile to add an off-by-default
feature like that--especially when WebKit already had code for
Japanese sniffing and ap@apple seems to think that's enough? It looks
pretty alarming to see a browser *add* to its content-based sniffing
capability!

For clarity: Is hamburger menu: Tools: Encoding: Auto Detect the UI for this?

Does this mean that any sort of content-based autodetection is off by
default even for the Japanese localization of Chrome?

On Fri, Dec 20, 2013 at 6:35 PM, John Cowan <cowan@mercury.ccil.org> wrote:
> Henri Sivonen scripsit:
>
>> The browser UI language is not visible from Google's index, so the
>> situation before this proposal is not something that can be determined
>> from Google's index.
>
> Not from the index, no.  But it is visible from the clickstream data of
> people clicking through Google search results.

Oh I see.

>> I was thinking of measuring success by comparing Firefox's Character
>> Encoding menu usage telemetry data in the last release without this
>> feature and the first release with this feature.
>
> That's a reasonable approximation when the guess is way off.  When I see
> a page labeled as 8859-1 that is really UTF-8, though, I may or may not
> force it to be UTF-8; sometimes I just read through the UTF-8.  I don't
> know if that's typical or not.

As an aside: It would be interesting to know how Greek users react to
broken Ά, since Firefox and Chrome guess ISO-8859-7 and IE guesses
windows-1253 and the byte position of Ά is the main difference.

>> Also, a software-only benchmark of TLD-based guessing only works if
>> there already is a (near) perfect content-based detector, so there's
>> the risk of faulty results if the detector used for comparison is
>> faulty.
>
> Granted.  Even so, search engines have a pretty strong incentive to
> get encodings right, as it has a big impact on the accuracy of search
> results.

Fair point, but I've grown very skeptical after I learned that Gecko's
"Universal" detector was not at all universal.

>> I've re-read the sentence a few times and I think my sentence makes
>> sense: "Should [Isreal] be [on the list of non-participating TLDs] in
>> case there's [Arabic encoding] legacy in addition to [Hebrew encoding]
>> legacy?"
>
> Ah, I see.  Yes, you are right; the sentence was a bit too elliptical
> for me.  The question, then, is whether there's a lot of Arabic content
> in .il addresses (as opposed to whether there are a lot of arabophones
> in Israel).

*Encoding-unlabeled* windows-1256 content, but yes.

On Fri, Dec 20, 2013 at 6:25 PM, Phillips, Addison <addison@lab126.com> wrote:
> UTF-8 detection based on byte sniffing is pretty accurate over very small runs of non-ASCII bytes. If there are no non-ASCII bytes in the first KB of plain text, you're no worse off than you were before.

No, you'd be worse off than before.

Consider an accidentally unlabeled UTF-8 site whose HTML template
fills the first kilobyte of each page with just pure-ASCII scripts and
styles, *except* for <title> and  the language of the site is one of
the European languages where accented characters occur fairly often
but not in every word so some page titles are pure ASCII and others
aren't. Introducing UTF-8  detection that only considers the first 1
KB  would seemingly randomly makes some pages on such a site work and
some fail. It is a bad idea to introduce such a non-obvious reason for
varying behavior, since it would waste people's time with
wild-goose-chase debugging sessions.

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/

Received on Saturday, 21 December 2013 12:17:09 UTC