W3C home > Mailing lists > Public > public-html-comments@w3.org > August 2009

Re: charset name matching rules

From: Erik van der Poel <erikv@google.com>
Date: Mon, 17 Aug 2009 07:56:54 -0700
Message-ID: <c07a32650908170756o482b3746jd57d69963d971fbb@mail.gmail.com>
To: Anne van Kesteren <annevk@opera.com>
Cc: Ian Hickson <ian@hixie.ch>, public-html-comments@w3.org
On Sun, Aug 16, 2009 at 2:41 AM, Anne van Kesteren<annevk@opera.com> wrote:
> On Sun, 16 Aug 2009 01:31:09 +0200, Erik van der Poel <erikv@google.com> wrote:
>> I had another look at section 2.7, and it does have a pointer to the
>> IANA charset registry, which also says "However, no distinction is
>> made between use of upper and lower case letters." This is the only
>> matching rule that we need. UTS22 is too lenient, and we all know what
>> happens to the Web when browsers are too lenient. If the discussion on
>> ietf-charsets@iana.org actually yields any more results, we may wish
>> to consider adding them to HTML 5, but for now, I think having HTML 5
>> refer to the IANA charset registry is sufficient.
>
> So I made a few tests to figure out the matching rules and
> case-insensitive does not seem like the only rule we need, though it
> depends a bit on which browser we want to follow. I made a few tests
> and run them through Opera (O), Firefox (F), and Chromium (C) (all on
> Ubuntu):

It would also be interesting to find out what MSIE and Firefox on
Windows do, and what Safari on Mac does.

>  http://dump.testsuite.org/2009/encoding-matching/
>
> Ignoring the fact that C treats ISO-8859-9 as Windows-1254 (which the other browsers should probably copy) the results are as follows:

I agree that ISO-8859-9 should be treated as its "superset" Windows-1254.

> Ignores leading whitespace: O, F, C

Interesting. If MSIE and Firefox on Windows do this too, it would
probably be a good idea to add this rule.

> Ignores whitespace within label: O
> Ignores leading ): O, C
> Ignores trailing @: O, C
> Allows underscores rather than hyphens for this encoding: O, C
> Ignores @ within label: O, C

If MSIE and Firefox on Windows do not do these, I think we should
consider omitting these rules.

> Now I'm positively certain that EUC-JP should not be recognized as
> EUC_JP and quite certain that C does not recognize it as such so I'm
> guessing ISO_8859_9 is an alias C supports, but documentation on that
> would be good.

Neither MSIE nor Firefox supports EUC_JP, so I don't know what
Chromium is hoping to accomplish by recognizing it. EUC-JP is used
much more often than EUC_JP on the Web. (About 500 times more often.)
In fact, UFT-8 and ISO-8559-1 occur more often than EUC_JP. (Look
carefully -- those are both misspellings.)

Erik
Received on Monday, 17 August 2009 14:57:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 June 2011 00:14:00 GMT