Re: charset name matching rules

Here, I'll do some "research" :-) The following is from UTS22:

1.4 Charset Alias Matching

Names and aliases of charsets are often spelled with small variations.
To recognize accidental but unambiguous misspellings and avoid adding
each possible variation to a list of recognized names, it is customary
to match names case-insensitively and to ignore some punctuation. For
best results, names should be compared after applying the following
transformations:

Delete all characters except a-z, A-Z, and 0-9.
Map uppercase A-Z to the corresponding lowercase a-z.
>From left to right, delete each 0 that is not preceded by a digit.
For example, the following names should match: "UTF-8", "utf8",
"u.t.f-008", but not "utf-80" or "ut8".

Note: These rules are in place because in practice implementations are
faced with many gratuitous variations in the use and omission of
punctuation. There are a small number of IANA names for different
charsets that match under these rules, but they appear to be rarely
used, obscure charsets: "iso-ir-9-1" and "iso-ir-9-2" match
"iso-ir-91" and "iso-ir-92", respectively. (There are also names in
the IANA charset registry that violate the registry's own name syntax
rules.)

-- End of excerpt from UTS22

Clearly, they recommend that you ignore not only the underscore, but
many other characters too. This is so different from current browser
behavior that I am surprised that it is even being considered.

I am not saying that the IANA charset registry is perfect, or that the
charset registration process flows smoothly. There are many confusing
entries in that registry.

But I don't think it is a good idea to then give up, and allow all
sorts of charset names with whatever punctuation you like.

The ietf-charsets group is currently talking about gathering the
browsers' lists of charsets, aliases and supersets (e.g. windows-1252
is the superset used instead of iso-8859-1). I believe we will bump
into several differences between the browsers, but I also believe that
the differences become less and less interesting as you go down the
list of popular charsets. So my suggestion is that we initially focus
on commonly used encodings. Then we can add more info to the HTML 5
spec (or a spin-off spec, if appropriate) over time.

Erik

On Sat, Aug 15, 2009 at 7:45 PM, Ian Hickson<ian@hixie.ch> wrote:
> On Sat, 15 Aug 2009, Erik van der Poel wrote:
>>
>> I had another look at section 2.7, and it does have a pointer to the
>> IANA charset registry, which also says "However, no distinction is
>> made between use of upper and lower case letters." This is the only
>> matching rule that we need.
>
> We definitely need more than that, I'm just not sure what exactly. The
> only difference between what we need and UTS22 that I know of is that
> UTS22 seems to also allow underscores to be ignored, which appears
> incompatible with browsers. More research here is probably necessary.
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>

Received on Sunday, 16 August 2009 05:17:53 UTC