Re: Charset support (was: Accept-Charset support) from Klaus Weide on 1996-12-13 (www-international@w3.org from October to December 1996)

From: Klaus Weide <kweide@tezcat.com>
Date: Fri, 13 Dec 1996 14:07:28 -0600 (CST)
To: Francois Yergeau <yergeau@alis.com>
cc: www-international@w3.org
Message-ID: <Pine.SUN.3.95.961213114247.29454J-100000@xochi.tezcat.com>
On Fri, 13 Dec 1996, Francois Yergeau wrote:
> À 23:06 12-12-96 -0600, Klaus Weide a écrit :
> >At least the Lynx code has tried to take that statement seriously, for
> >display of text without an explicit charset.
> 
> Just to make things clear, does that mean that Lynx always defaults to
> Latin-1, going through hoops (mapping, transliteration, etc.) to achieve
> that if necessary?

Speaking about the last released version (2.6, but also earlier ones):
it doesn't go through any hoops to translate *from* anything to Latin-1,
since it is not doing that at all.  But it translates *from* Latin-1
to the selected "Display Character Set".

Also (and I would like to be able to say "of course!") all numeric char
references are interpreted w.r.t. Latin-1.

> Or that it can be made to assume that default, but
> normally defaults to the terminal's own code page?

No it is really defaulting to that default.  There is a "raw"
toggle/command flag, which has to be explicitly enabled each session.
When it is ON, translation from Latin-1 to Display Character Set is 
suppressed (and 0x80-0x9F may be let through), so in that case one gets 
the "remote charset assumed to be like terminal" behaviour.

For completeness it has to be added that, if all sides lie to Lynx then
it can't do anything about it 'cause it won't know.  Meaning if the user
on, say, a KOI8 terminal sets "Display Character Set" to Latin-1 *and*
the server sends KOI8-R data but labels it as iso-8859-1 (or that is
defaulted to).  That will seem to work until one of the two sides 
becomes honest.

(I am adding translation *from* some other charsets as well as
decoding of UTF-8 to the code, and that may become part of the Lynx
release.  I probably shouldn't call it Unicode [or UCS] support,
especially in this forum, since there is no notion of character
classes, right-to-left, etc. that seems to imply.)

> >> Or do you mean the statement that one can assume that all HTTP clients
> >> support ISO 8859-1?  Again, this is patently false; try Lynx on a
> >> non-Latin-1 terminal.
> >
> >Lynx definitely supports ISO 8859-1, whether a Latin-1 terminal is in
> >use or not.
> 
> Great, I didn't know that.  Apologies for slandering this venerable pillar
> of the Web.  

:)

> Doesn't help, though, with all those other browsers that depend
> exclusively on the platform's code page, with no support for Latin-1 on
> non-Western systems.  The assumption of *universal* support for Latin-1 in
> HTTP/1.1 is still false, despite the nice efforts of browsers like Lynx.

Yes, it's hard to disagree here.

> And the implicit obligation of Latin-1 support it creates is still
> unjustified.  Why should a minimal browser on a Russian DOS box be forced to
> go through hoops to support 8859-1, when a German or Spanish or American
> browser is not required to support 8859-5?  

It is my understanding that the folks who would be using 8859-2 haven't
yet agreed on whether to use that or windows-1250 or cp852 or ...,
and those who might use 8859-5 are also split (KOI8-R, "alt", ...).
There are several encodings in use for Japanese.  That Russian DOS box
either already has learnt to speak several charsets, or it will not even
be able to understand its neighbors in the same region.  

> The functionning of the protocol
> doesn't require Latin-1, only ASCII.

I guess there is also nothing that makes US-ASCII inherently better
than any other 646 variant.  (Well except for writing program code...)
It has some characters that were not exactly in widespread use around
the world.  Yet in the end it won general acceptance.

Isn't it sometimes more important to agree on something, just anything,
than continue everyone with their own default?  If the answer to that
is Yes, then currently, *as a default character encoding for page
content*, I don't see anything that makes as much sense as iso-8859-1.

Even if it seems right now that nearly everybody around the world does
pretty much what they want (let the client guess what we are sending,
after all it works most of the time or we just don't know any better)
--- there is a history in the drafts and specs of Web protocols that
said "iso-8859-1 is default".  One would think the world joined the
World-Wide Web under those conditions...

> >(resulting from recent responses on this list) that for UTF-8 support,
> >GUI browsers will also have to resort to replacement representations
> >(unless they _really_ have access to a set of glyphs for the full BMP
> >repertoire), so they will become more like Lynx<G>.
> 
> It would be foolish to require support for the whole Unicode repertoire from
> any piece of software.  Supporting UTF-8 should mean soemthing like decode
> it properly, don't mistake it for something else, and try your best to
> display/process/whatever your software does.

From some responses it seemed the NC in everybody's hands is right
around the corner, which would then make all-you-can-eat of fonts
appear on the screen via some Java magic (presumably with negligible
cost and delay)...  but I rather like your definition of "supporting 
UTF-8".  There's nothing wrong with displaying _U1234_ if necessary,
I suppose.

> [...]  Of course it
> is not universally disregarded, for lots of people it is the appropriate
> default.  But far from everyone.

But it is only a default, and is not _that_ hard to change by proper
content labelling.  Again, I think one default is better than a thousand
defaults.

I should clarify that above I am referring to charsets for entity bodies.
The part of the HTTP draft about charsets in Warning headers seemed,
uhmm, antiquated when I first read it (some months ago).  I can agree
that iso-8859-1 in a special rôle doesn't seem to belong there.

   Klaus
Received on Friday, 13 December 1996 15:08:32 UTC