Re: additional codesets and linemode browser from Klaus Weide on 1999-02-02 (www-lib@w3.org from January to March 1999)

From: Klaus Weide <kweide@tezcat.com>
Date: Tue, 2 Feb 1999 07:54:13 -0600 (CST)
To: Rick Kwan <kenobi@coruscant.lightsaber.com>
cc: www-lib@w3.org
Message-ID: <Pine.SUN.3.95.990202062803.28675B-100000@xochi.tezcat.com>
On Mon, 1 Feb 1999, Rick Kwan wrote:

> At the bottom of his message, Klaus asks:
> > 
> > Does this make sense to anyone? :)
> > 
> >     Klaus
> 
> I think I caught most of it.  But let me see if I understand
> a couple of things.
> 
>     1.	This sounds like HTML files must be encoded in UTF-8.

Actually, no; I was concentrating on the central piece of SGML(like-)
parsing, and examining (well, more or less just thinking loud) what
it would take to make that central part deal with the various charset
situations in a clean manner.  The SGML parsing would deal only with
one kind of representation (which happens to be the one defined as the
Socument character set of HTML in SGML terms), but the representation
in files could be completely different.  All that UTF-8 in files would
give you is that the conversion UTF-8 -> UCS-{2,4} is simple, but there
could be an arbitrary number of file-character-encoding -> UCS converters
based on mapping tables or whatever else is needed.

Once you have a pair of converter pairs charsetA <-> UCS and charsetB <->
UCS, you get the possibily to transcode charsetA <-> charsetB kind of
automatically.

Since it's (unfortunately) completely unrealistic to expect mapping
tables for a given charset to be available, or even to expect that
the charset of external data is known, there should then be a default
way to sneak data through the SGML parser without mutilating them
(a reversible tranformation), and that could be done nicely with
a trivial mapping to/from some private zone.  It would remove the
need for the various states in SGML_write within `#ifdef ISO_2022_JP'
(they aren't really enough anyway - multibyte characters in attribute
values aren't handled), there wouldn't be a '<' or <"> char that is
part of a multibyte being misinterpreted as having parsing significance
since it would have been tranformed when SGML_write sees it.

>     2.	A lot of currently single-byte routines need to be
> 	converted to handle 16-bit or 32-bit Unicode characters.

Not that many, if the goal is only to have SGML.c charset-clean;
some (registrable) streams and functions for conversion, to act as
"adapters" before and after the SGML stage; the rest could (but need
not) remain all C strings.

> My personal comments on these matters:
> 
>     1.	UTF-8 is nice, but most Asian HTML files will be written in
> 	national codesets, e.g., KSC-5601, JIS or shift-JIS, Big-5
> 	or EUC-CNS (euc-tw), or GB.  These cannot be ignored in
> 	preference to UTF-8 because most authors won't have UTF-8
> 	tools.

Yes, I am aware of that reality.  I didn't mean to force UTF-8 or
raw UCS on anybody - this could be invisible to the user, if UCS (or
some fake-UCS) used only internally.

>     2.	I am ambivalent about the development and performance
> 	tradeoffs between single-width Unicode vs multi-byte
> 	codesets.  I agree that you don't want to bloat statically
> 	linked code with multiple codesets; this results in
> 	re-compilation and re-link each time a new language is
> 	supported.  A dynamically loadable solution is preferred.

There is of course some performance penalty for wide characters;
but if they are only used temporarily - and SGML.c already just
streams the data thru, and has to look at each character individually -
it shouldn't be much compared to all the other precessing necessary
for text.  (And if you're not dealing with text, or don't want it
parsed/rendered, you don't need to do any of it.)

I'm kind of thinking of libwww as an example framework that tries to get
it right in a flexible way, not necessarily in the most efficient and
hackerish way (except at the HTTP protocol level maybe).

>     3.	This may be obvious to many:  as far as linemode browser
> 	is concerned, there is work to be done both in SGML.c and in
> 	places like HTBrowse.c, where text presentation takes place.
> 	Visual width and text string width are not the same thing.

I know very little about that...

> I've been silent about this until now because, having done
> some Unicode and multi-byte work, the stuff scares me to death!
> But, yes, I agree that multi-lingual support would be a
> nice thing to see happen.

Well, I can't read any of the scripts that need multi-byte characters,
so everything looks just like a stream of bits which can be transformed
into another stream of bits - makes it less scary. :)
I am of course avoiding all the different questions of rendering, just
talking about a part that IMHO _should_ be handled in a generic manner:
provide a way to get the text data parsed, and if possible converted
to the preferred character set, to HTML.c and HTBrowse.c (or equivalents),
the rest is then up to whoever wants to really implement the rendering
with all the necessary script-/directionality-/font-/language-specific
extra knowledge.

Just some thoughts triggered by the question...

   Klaus
Received on Tuesday, 2 February 1999 08:54:19 UTC