Re: charset name matching rules from Anne van Kesteren on 2009-08-16 (public-html-comments@w3.org from August 2009)

From: Anne van Kesteren <annevk@opera.com>
Date: Sun, 16 Aug 2009 11:41:10 +0200
To: "Erik van der Poel" <erikv@google.com>, "Ian Hickson" <ian@hixie.ch>
Cc: public-html-comments@w3.org
Message-ID: <op.uyq06wnu64w2qv@annevk-t60>

On Sun, 16 Aug 2009 01:31:09 +0200, Erik van der Poel <erikv@google.com> wrote:
> I had another look at section 2.7, and it does have a pointer to the
> IANA charset registry, which also says "However, no distinction is
> made between use of upper and lower case letters." This is the only
> matching rule that we need. UTS22 is too lenient, and we all know what
> happens to the Web when browsers are too lenient. If the discussion on
> ietf-charsets@iana.org actually yields any more results, we may wish
> to consider adding them to HTML 5, but for now, I think having HTML 5
> refer to the IANA charset registry is sufficient.

So I made a few tests to figure out the matching rules and case-insensitive does not seem like the only rule we need, though it depends a bit on which browser we want to follow. I made a few tests and run them through Opera (O), Firefox (F), and Chromium (C) (all on Ubuntu):

  http://dump.testsuite.org/2009/encoding-matching/

Ignoring the fact that C treats ISO-8859-9 as Windows-1254 (which the other browsers should probably copy) the results are as follows:

Ignores leading whitespace: O, F, C
Ignores whitespace within label: O
Ignores leading ): O, C
Ignores trailing @: O, C
Allows underscores rather than hyphens for this encoding: O, C
Ignores @ within label: O, C

Now I'm positively certain that EUC-JP should not be recognized as EUC_JP and quite certain that C does not recognize it as such so I'm guessing ISO_8859_9 is an alias C supports, but documentation on that would be good.

-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Sunday, 16 August 2009 09:42:00 UTC