Re: Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from Henri Sivonen on 2014-01-02 (www-international@w3.org from January to March 2014)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Thu, 2 Jan 2014 11:48:37 +0200
To: "www-international@w3.org" <www-international@w3.org>
Message-ID: <CANXqsRLGARkObfE33vWp5dQM4CcpWW=4XwiSpGRFQroohsA8kQ@mail.gmail.com>
On Mon, Dec 23, 2013 at 1:56 AM, Leif Halvard Silli
<xn--mlform-iua@xn--mlform-iua.no> wrote:
> But there is also a chance - especially if the gotcha becomes a
> frequent issue - that authors would as well discover how to *trigger*
> UTF-8 detection.
...
> Why not simply us a BOM ...

Right. If you want to use an early UTF-8 byte sequence to trigger
UTF-8 treatment, use the BOM as your early UTF-8 byte sequence. It
already works cross-browser and cross-locale. (Though, granted, of
possible UTF-8 byte sequences, it's the most brittle one in terms of
text editors maybe silently removing it.)

> Why does Europe’s largest social network, www.vk.com,
> use Windows-1251 - even for Asian scripts?

>From personal experience from over a decade ago, if you face a legacy
code base written without much thought about encodings and a database
containing bytes in the local legacy encoding plus numeric character
references submitted by browsers and not sanitized in any way, it may
be less disruptive to continuous operations to add code that
formalizes the use of numeric character references in the database
than to migrate everything to UTF-8. Of course, in the long term,
you'd have been better off doing the UTF-8 migration up front.

On Mon, Dec 23, 2013 at 2:00 AM, Leif Halvard Silli
<xn--mlform-iua@xn--mlform-iua.no> wrote:
> Henri Sivonen, Thu, 19 Dec 2013 16:29:37 +0200:
>>
>> The list of TLDs that participate in the guessing and are not
>> windows-1252-affiliated is currently:
>>
> https://bugzilla.mozilla.org/attachment.cgi?id=8341644&action=diff#a/dom/encoding/domainsfallbacks.properties_sec2
>>
>> UTF-8 is never guessed, since it is not a legacy encoding.
>
> But not all domains are “legacy domains” either. Consider, from the
> above list, line 139 and 140:
>
>         139 ru=windows-1251
>         140 xn--p1ai=windows-1251
>
> where xn--p1ai refers to the RF-domain - .рф. Is there really no
> correlation between UTF-8 based domain names and use of the UTF-8
> encoding ... ?

xn--p1ai isn't a UTF-8 domain name. It's a Punycode domain name. :-)

Anyway, the feature avoids guessing outcomes that aren't already
possible under the current localization-based guessing regime. That
means never guessing UTF-8.

Would you rather guess windows-1252 for xn--p1ai?

On Mon, Dec 23, 2013 at 11:17 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> Overall, I agree with the question by others of what's the expected "ROI" on
> this is. With UTF-8 being more and more popular for Web sites, the return
> for changing fallback encodings is definitely deminishing.

The return is definitely diminishing, but the issue of undeclared
legacy encodings hasn't diminished far enough to make no one ever ask
for a character encoding menu. I think is terribly sad that one was
already added for Firefox for Android. I think TLD-based guessing will
have been worthy if it successfully prevents the addition of a
character encoding override menu to the browser app on Firefox OS.
(Which in practice means making people who read Chinese and are
involved in Firefox OS feature triage not experience the need to
override the encoding even when using an en-US build to read unlabeled
legacy-encoded Simplified or Traditional Chinese sites or when using
zh-TW builds to read unlabeled legacy-encoded Simplified Chinese
sites.)

I think the mission would be completely accomplished if this feature
allowed for the removal of the character encoding menu from all
Gecko-based products down the road. The character encoding override
menu is not only bad UI but it introduces a lot of complexity to the
Web engine if you want to make the Web engine to be secure even when
operated by users who don't know about the security properties of
character encodings. In 2012 and 2013, I spent a significant amount of
time making Gecko secure against the sort of XSS that involves
tricking the user to use the character encoding menu and making the
character encoding menu UI less terrible in terms of usability. One
might take the stance that the bulk of that work is already sunk cost
and there's no need to get rid of the menu after it's *almost* been
fixed in the browser *I* work on, but I've instead taken the stance
than neither I nor anyone else should have to deal with the
complication arising from the character encoding menu in the future in
Firefox or in another code base.

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/
Received on Thursday, 2 January 2014 09:49:05 UTC