- From: Henri Sivonen <hsivonen@hsivonen.fi>
- Date: Fri, 20 Dec 2013 16:15:40 +0200
- To: www-international@w3.org
On Thu, Dec 19, 2013 at 7:37 PM, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> wrote: > Henri Sivonen, Thu, 19 Dec 2013 16:29:37 +0200: > >> Chrome seems to have content-based detection for a broader >> range of encodings. (Why?) > > Do you mean Google’s “automatic” option found in its encoding (sub) > menu? By default, that option isn’t enabled. I mean the code seen at https://mxr.mozilla.org/chromium/source/src/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp#40 It is unclear to me under which conditions this code runs. > (Btw, in Blink-based Opera, there is no (sub)menu for selecting > the encoding at all.) Indeed. Unfortunately, that might change: https://twitter.com/odinho/status/408682238247972864 > This function then supposedly steps into the encoding detection > algorithm at the same places where the locale based fallback currently > steps in. Right. It adds a step right before that step. The guessing continues to the localization-based guess if there's no TLD (no host or IP address host) or the TLD is one that's listed as not participating in this scheme (e.g. .com). > As such, it isn’t a given that this method would *replace* > detection. However, it sounds like you consider it as an alternative to > detection. I don't expect it to serve as an alternative to detection that ships in Firefox enabled by default. (Firefox ships enables locale-specific content-based detectors for the Japanese, Russian and Ukrainian localizations by default. These detectors are designed to choose between various Japanese, Russian or Ukrainian encodings, respectively. In other locales, content-based detection is off by default. I suspect we might even be able to remove the Russian and Ukrainian detectors.) However, the feature is inspired by the disabled by default detectors of Firefox. Specifically, it's inspired by this case: Suppose you are in Hong Kong or Taiwan and run the Traditional Chinese localization of Firefox. This means that the fallback encoding (currently without this feature I'm proposing) is Big5, which makes sense for unlabeled legacy Traditional Chinese content. However, this fallback fails if you want to read Simplified Chinese legacy content published in the PRC. However, to the extent that content is published under the .cn TLD, you'd have a better experience if the .cn TLD made your browser apply the guess that the Simplified Chinese localization currently. This proposal is a generalization from that idea by making the guessing apply to more than .cn, .hk and .tw. > Negatives: > > * When I filed a bug to get Webkit to do some UTF-8 sniffing, I was > told, as negative thing, that users would then rely on it (instead of > label their code as UTF-8) and that this could decrease > interoperability. This proposal never ends up guessing UTF-8, so this never gives anyone a reason not to declare UTF-8. > It seems like this feature could potentially have the > same effect. The guessing outcomes that can arise from this proposal can already arise from existing localizations. So if a site is broken by this change, it would already have appeared broken in the exact same way in some existing browser localization. In that sense, what ever interoperability problem this might cause is a pre-existing problem. But you are right that if browser A implements the proposal and browser B does not, localization of A for language L and localization of B for the *same* language L could render a given site differently. > * A site might have two domain names, such as foo.com and foo.ru, for > which one would get different results. True, but it strictly makes the site work more often. (I assume you mean it's a Russian-language site that uses unlabeled windows-1251.) Before this change, the site under both domains works in browser localizations for languages that use the Cyrillic script and, therefore, use windows-1251 as the fallback encoding and fails in other browser localizations under both domains. With this proposal, the situation in the .com case is unchanged, but the .ru case starts working in all browser localizations. > * It might be not be very useful. It could perhaps have been more > useful if it was introduced earlier. We are definitely into the diminishing returns department already. If you look at the Firefox sessions in which the Character Encoding menu is used relative to all Firefox sessions, you might as well round it to complete unuse: http://telemetry.mozilla.org/#release/26/CHARSET_OVERRIDE_USED However, if you break the data down by locale (https://bug906032.bugzilla.mozilla.org/attachment.cgi?id=796536) we learn the following: * The fallbacks still matter, because localizations that chose an inappropriate fallback (UTF-8 which isn't a legacy encoding; fixed already) top the list. * If it weren't for those localizations, Simplified Chinese (zh-TW; Firefox uses that code instead of zh-Hant for historical reasons) would top the list, which suggest the use case inspiring this proposal (see above) is relevant still. * Curiously, Korean is rather high on the list even though Korean has a single non-windows-1252 legacy encoding, so all things being equal, it shouldn't be higher than e.g. Hebrew, so all things apparently aren't equal. Because the CJK locales stand out, the time is not yet ripe for getting rid of the Character Encoding menu. I'm hoping that especially guessing based on the CJK-related TLDs would reduce the need of the users of the CJK locales (zh-TW especially) to resort to the Character Encoding menu, so that down the road, we could get rid of the menu and the cost of dealing with all the edge cases associated with having the menu. OTOH, I want to avoid the Character Encoding menu getting introduced to Firefox OS. > At the moment, after 30 minutes on > Google trying to locate some Cyrillic "Latinifications (such as а > Windows "я"/"Я" represented as "ÿ"/"ß"), I did not locate a single page > where this was the but where the page was *not* labeled as UTF-8 or > some other mishap of that kind. (But this might of course be because > Google act as a localized browser when it finds "ÿ"/"ß" in unlabeled > page under .ru. As John Cowan noted, Google Search is likely papering over problems of this nature, so you can't locate the problems using Google Search. > Are there many examples, in real life (nowadays), where different > localization causes different rendering? The languages that are associated with a fallback other than windows-1252 when used as a Firefox UI language are listed in https://mxr.mozilla.org/mozilla-central/source/dom/encoding/localesfallbacks.properties Since the list is non-empty, there evidently is opportunity for different localizations to render unlabeled sites differently. > The goals sound noble enough. > >> # Why is this better that analyzing the content itself? > > Browsers would still be free to do that - e.g. for detecting UTF-8? Except maybe for file: URLs (where we could make the UA look at the whole file before displaying anything), I think detecting UTF-8 is a bad idea for reasons briefly mentioned in the message that started this thread. However, I am not advocating for a ban on all content-based sniffing, because I'm currently in the belief that we won't be able to get rid of sniffing for the various Japanese encodings. Other than sniffing for Japanese encodings, I'd be in favor of banning content-based sniffing (unless research shows that we also really need it for the Russian and Ukrainian cases). But that's outside the scope of this proposal. On Thu, Dec 19, 2013 at 9:14 PM, John Cowan <cowan@mercury.ccil.org> wrote: > Henri Sivonen scripsit: > >> Chrome seems to have content-based detection for a broader range of >> encodings. (Why?) > > Presumably because they believe it improves the user experience; > I don't know for sure. Presumably, of course. I was hoping that someone involved in introducing this difference between Chrome and Safari would elaborate on the motivation in more detail. > What I do know is that Google search attempts to > convert every page it spiders to UTF-8, and that they rely on encoding > detection rather than (or in addition to) declared encodings. In > particular, certain declared encodings such as US-ASCII, 8859-1, and > Windows-1252, are considered to provide *no* encoding information. Indeed, considering Firefox telemetry data globally, it is more common to use the Character Encoding menu to override a declared encoding than to override a bad fallback guess. (Sadly, it's also more common to override a previous override that to override a bad fallback guess, which means that users who do use the menu do it by trial and error.) See http://telemetry.mozilla.org/#release/26/CHARSET_OVERRIDE_SITUATION for the numbers and https://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#1992 for what the buckets mean. Still, I think browsers should not start disrespecting site-supplied encoding labels. Browsers disrespecting site-supplied MIME types turned into enough of a mess. (Though the blame falls on server developers, specifically Apache developers refusing for years to fix their bug. In both cases, I strongly disapprove of default server configurations supplying values, because when the values come from server defaults, they may have no relation to actual content served.) > Before modifying existing encoding-detection schemes, I would ask > someone at Google (or another company that spiders the Web extensively) > to find out just how much superior the revised scheme would be when > applied to the existing Web, rather than trusting to _a priori_ > arguments. This proposal doesn't change anything except the step that currently is based on guessing from the browser UI language. The browser UI language is not visible from Google's index, so the situation before this proposal is not something that can be determined from Google's index. I was thinking of measuring success by comparing Firefox's Character Encoding menu usage telemetry data in the last release without this feature and the first release with this feature. Also, a software-only benchmark of TLD-based guessing only works if there already is a (near) perfect content-based detector, so there's the risk of faulty results if the detector used for comparison is faulty. However, it would be awesome if someone with access to global Web crawl data produced a sample of unlabeled pages under each non-obvious TLD (no point in doing this for obviously windows-1252-affiliated TDLs like .fi) to allow human inspection of a small sample of pages to validate the mapping. >> * The domain name is a country TLD whose legacy encoding affiliation >> I couldn't figure out: .ba, .cy, .my. (Should .il be here in case >> there's windows-1256 legacy in addition to windows-1255 legacy?) > > 1256 is Arabic, 1255 is Hebrew, Right. > so I assume you meant the other way around. I've re-read the sentence a few times and I think my sentence makes sense: "Should [Isreal] be [on the list of non-participating TLDs] in case there's [Arabic encoding] legacy in addition to [Hebrew encoding] legacy?" I think the question is reasonable given 18% of the population (according to Wikipedia) report Arabic as their mother tongue and all major browsers have Arabic localizations. -- Henri Sivonen hsivonen@hsivonen.fi https://hsivonen.fi/
Received on Friday, 20 December 2013 14:16:09 UTC