- From: Erik van der Poel <erikv@google.com>
- Date: Sat, 15 Aug 2009 22:17:15 -0700
- To: Ian Hickson <ian@hixie.ch>
- Cc: public-html-comments@w3.org
Here, I'll do some "research" :-) The following is from UTS22: 1.4 Charset Alias Matching Names and aliases of charsets are often spelled with small variations. To recognize accidental but unambiguous misspellings and avoid adding each possible variation to a list of recognized names, it is customary to match names case-insensitively and to ignore some punctuation. For best results, names should be compared after applying the following transformations: Delete all characters except a-z, A-Z, and 0-9. Map uppercase A-Z to the corresponding lowercase a-z. >From left to right, delete each 0 that is not preceded by a digit. For example, the following names should match: "UTF-8", "utf8", "u.t.f-008", but not "utf-80" or "ut8". Note: These rules are in place because in practice implementations are faced with many gratuitous variations in the use and omission of punctuation. There are a small number of IANA names for different charsets that match under these rules, but they appear to be rarely used, obscure charsets: "iso-ir-9-1" and "iso-ir-9-2" match "iso-ir-91" and "iso-ir-92", respectively. (There are also names in the IANA charset registry that violate the registry's own name syntax rules.) -- End of excerpt from UTS22 Clearly, they recommend that you ignore not only the underscore, but many other characters too. This is so different from current browser behavior that I am surprised that it is even being considered. I am not saying that the IANA charset registry is perfect, or that the charset registration process flows smoothly. There are many confusing entries in that registry. But I don't think it is a good idea to then give up, and allow all sorts of charset names with whatever punctuation you like. The ietf-charsets group is currently talking about gathering the browsers' lists of charsets, aliases and supersets (e.g. windows-1252 is the superset used instead of iso-8859-1). I believe we will bump into several differences between the browsers, but I also believe that the differences become less and less interesting as you go down the list of popular charsets. So my suggestion is that we initially focus on commonly used encodings. Then we can add more info to the HTML 5 spec (or a spin-off spec, if appropriate) over time. Erik On Sat, Aug 15, 2009 at 7:45 PM, Ian Hickson<ian@hixie.ch> wrote: > On Sat, 15 Aug 2009, Erik van der Poel wrote: >> >> I had another look at section 2.7, and it does have a pointer to the >> IANA charset registry, which also says "However, no distinction is >> made between use of upper and lower case letters." This is the only >> matching rule that we need. > > We definitely need more than that, I'm just not sure what exactly. The > only difference between what we need and UTS22 that I know of is that > UTS22 seems to also allow underscores to be ignored, which appears > incompatible with browsers. More research here is probably necessary. > > -- > Ian Hickson U+1047E )\._.,--....,'``. fL > http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. > Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' >
Received on Sunday, 16 August 2009 05:17:53 UTC