charset name matching rules from Erik van der Poel on 2009-08-15 (public-html-comments@w3.org from August 2009)

From: Erik van der Poel <erikv@google.com>
Date: Sat, 15 Aug 2009 08:42:17 -0700
To: public-html-comments@w3.org
Message-ID: <c07a32650908150842j499634abh5cf8e7054925f808@mail.gmail.com>

In section 2.7 of HTML 5, it says:

> When comparing a string specifying a character encoding with the name
> or alias of a character encoding to determine if they are equal, user
> agents must use the Charset Alias Matching rules defined in Unicode
> Technical Standard #22. [UTS22]
>
> For instance, "GB_2312-80" and "g.b.2312(80)" are considered equivalent names."

I think this should be removed, since none of the major browsers do
this, and it is too lenient.

The general approach should be: As lenient as the major browsers, but
not more lenient. Lenience leads to a proliferation of garbage.

Of course, the question is what to replace the above text with. There
is a discussion on the ietf-charsets@iana.org list about gathering the
current lists of charsets and aliases from the browsers. Hopefully,
that discussion will result in something that can be published in HTML
5.

How about putting a placeholder in the current HTML 5 draft? I
consider UTS22 to be harmful, so it should be removed from HTML 5
ASAP.

Erik

Received on Saturday, 15 August 2009 15:42:56 UTC