Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization

Henri Sivonen scripsit:

> Chrome seems to have content-based detection for a broader range of
> encodings. (Why?)

Presumably because they believe it improves the user experience;
I don't know for sure.  What I do know is that Google search attempts to
convert every page it spiders to UTF-8, and that they rely on encoding
detection rather than (or in addition to) declared encodings.  In
particular, certain declared encodings such as US-ASCII, 8859-1, and
Windows-1252, are considered to provide *no* encoding information.

Before modifying existing encoding-detection schemes, I would ask
someone at Google (or another company that spiders the Web extensively)
to find out just how much superior the revised scheme would be when
applied to the existing Web, rather than trusting to _a priori_
arguments.

>  * The domain name is a country TLD whose legacy encoding affiliation
> I couldn't figure out: .ba, .cy, .my. (Should .il be here in case
> there's windows-1256 legacy in addition to windows-1255 legacy?)

1256 is Arabic, 1255 is Hebrew, so I assume you meant the other way around.

-- 
John Cowan  cowan@ccil.org  http://ccil.org/~cowan
If he has seen farther than others,
        it is because he is standing on a stack of dwarves.
                --Mike Champion, describing Tim Berners-Lee (adapted)

Received on Thursday, 19 December 2013 19:15:04 UTC