[Prev][Next][Index][Thread]

Re: Accept-Charset support



On Fri, 6 Dec 1996, Chris Lilley wrote:

> > HTTP/1.0 gave a list:
> >
> >      charset = "US-ASCII"
> >              | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
> >              | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
> >              | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
> >              | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
> >              | "UNICODE-1-1" | "UNICODE-1-1-UTF-7" | "UNICODE-1-1-UTF-8"
> >              | token
> 
> well, token covers all the rest ;-)
> 
> Incidentally
> 
> bash$ grep UNICODE-1-1-UTF-8 character-sets.txt
> bash$
> 
> UNICODE-1-1-UTF-8 does not appear to be registered; although RFC 1641
> postulates it as a theoretical entity,  RFC 2044 (not yet diffused to all
> mirrors) specified UTF-8.

It was there, until around the time when RFC 2044 got published.  
At that point UNICODE-1-1-UTF-8 disappeared without any trace from
the registry, and UTF-8 appeared instead.

"UNICODE-1-1-UTF-8" had been around for a while, and there are a
number of Web pages that use this string for labelling the document's
charset.  In contrast, I haven't yet come across any pages which use
"UTF-8".  Those who were using "UNICODE-1-1-UTF-8" suddenly find
themselves in the position of using a totally unregistered charset
parameter (not even registered as an alias).  And note that RFC 2044
has just "Informational" status.

> > and the appendix of HTTP/1.1 includes a list of 'preferred names':
> >
> >        "US-ASCII"
> >        | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
> >        | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
> >        | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
> >        | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
> >        | "SHIFT_JIS" | "EUC-KR" | "GB2312" | "BIG5" | "KOI8-R"
> >
> >        "EUC-JP" for "EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE"
> 
> Did anyone verify that all the 8859 charsets are used or useful?

If all those have some canonical status, there is no good reason why
ISO-8859-10 doesn't exist in that list.  (Well, I don't know of any -
except maybe to protect sofware which does a strncmp() instead of a 
strcmp() when comparing charset parameters...)

> I notice that Unicode-1-1 (ie UCS-2) and UTF-8 are missinng from the
> HTTP/1.1 list, is there a reason for this?
> 
> Is there a registration request being processed for the EUC-JP alias
> or is that yet to be done?

Is there any publicly well-known process in place, right now, while the ID
mentioned below is still a draft?

> > and I'm guessing the right place to fix this up for good is in the
> > final edition of:
> >
> > ftp://ftp.isi.edu/internet-drafts/draft-freed-charset-reg-01.txt
> 
> Thanks for the reference. I see that it only allows character sets
> owned by national bodies to be registered from now on. This may be a

I cannot find anything like that in the draft.  The thing that comes
closest is the parapraph (from 3.2)

"As such, only character sets defined by other processes and
standards bodies, or specific profiles of such character sets,
are eligible for registration."

But it says something quite different.  Standards bodies are only
mentioned in parallel with "other [undefined] processes", and nothing
is said about who can originate "specific profiles".  (and the word
"own" doesn't appear anywhere in the draft, which I personally prefer.)

> good idea or it may not (I recall that the early drafts of 10646 were
> essentially all the national standard character sets catted together
> with scant reference to actual practice).

While we are grepping through the character-sets file - has aybody
else noticed that it contains *two* entries for ISO-8859-1,
misspellings which contradict the referenced RFC (ISO_8859-6-E etc.),
and other inconsistencies and apparently typos?  (I wrote to IANA
about some of them several months ago and never received a reply, and
they are still there.)  Somehow this doesn't inspire much confidence
in the registry or in a working process.

  Klaus Weide


References: