Re: [XHR2] Avoiding charset dependencies on user settings from Henri Sivonen on 2011-09-30 (public-webapps@w3.org from July to September 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 30 Sep 2011 09:43:12 +0300
To: public-webapps@w3.org
Message-ID: <CAJQvAucwSmx5v_neTZYo4o9g3xpx_FqKxvx_Ez30BRno6Annqw@mail.gmail.com>

On Thu, Sep 29, 2011 at 11:27 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> Finally, XHR allows the programmer using XHR to override the MIME
>> type, including the charset parameter, so if the person adding new XHR
>> code can't change the encoding declarations on legacy data, (s)he can
>> override the UTF-8 last resort from JS (and a given repository of
>> legacy data pretty often has a self-consistent encoding that the XHR
>> programmer can discover ahead of time). I think requiring the person
>> adding XHR code to write that line is much better than adding more
>> locale and/or user setting-dependent behavior to the Web platform.
>
> This is certainly a good point, and is likely generally the easiest
> solution for someone rolling out a AJAX version of a new website
> rather than requiring webserver configuration changes. However it
> still doesn't solve the case where a website uses different encodings
> for different documents as described above.

If we want to *really* address that problem, I think the right way to
address it in XHR would be to add a way to XHR to override the HTML
last resort encoding so that authors who are dealing with a content
repository migrated partially to UTF-8 can set the last resort to the
legacy encoding they know they have instead of ending up overriding
the whole HTTP Content-Type for the UTF-8 content. (I'm assuming here
that if someone is migrating a site from a legacy encoding to UTF-8,
the UTF-8 parts declare that they are UTF-8. Authors who migrate to
UTF-8 but are *still* after realizing that legacy encodings suck UTF-8
rocks too clueless to *declare* that they use UTF-8 don't deserve any
further help from browsers, IMO.)

> I'm particularly keen to hear how this will affect locales which do
> not use ascii by default. Most of the contents I personally consume is
> written in english or swedish. Most of which is generally legible even
> if decoded using the wrong encoding. I'm under the impression that
> that is not the case for for example Chinese or Hindi documents. I
> think it would be sad if we went with any particular solution here
> without consulting people from those locales.

The old way of putting Hindi content on the Web relied on
intentionally misencoded downloadable fonts. From the browser's point
of view, such deep legacy text is Windows-1252. Hindi content that
works without misencoded fonts is UTF-8. So I think Hindi isn't
relevant to this thread.

Users in CJK and Cyrillic locales are the ones most hurt by authors
not declaring their encodings (well, actually, readers of CJK and
Cyrillic languages whose browsers are configured for other locales are
hurt *even* more), so I think it would be completely backwards for
browsers to complicate new features in order to enable authors in the
CJK and Cyrillic locales deploy *new* features and *still* not declare
encodings. Instead, I think we should design new features to make
authors everywhere get their act together and declare their encodings.
(Note that this position is much less extreme than the more
enlightened position e.g. HTML5 App Cache manifests take: Requiring
everyone to use UTF-8 for a new feature so that declarations aren't
needed.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 30 September 2011 06:43:49 UTC