Re: additional codesets and linemode browser from Rick Kwan on 1999-02-03 (www-lib@w3.org from January to March 1999)

From: Rick Kwan <kenobi@coruscant.lightsaber.com>
Date: Wed, 3 Feb 1999 12:00:16 -0800
To: www-lib@w3.org
Message-Id: <199902032000.MAA24933@coruscant.lightsaber.com>
First, in a nutshell:
  1. If all the codeset conversions can be contained to SGML.c,
     then this is not too bad.
  2. Some thoughts below about implementing multiple locales,
     including rendering issues and where to draw the line.

Now to the details...

> From: Klaus Weide <kweide@tezcat.com>
> To: Rick Kwan <kenobi@coruscant.lightsaber.com>
> cc: www-lib@w3.org
> Subject: Re: additional codesets and linemode browser
> 
> On Mon, 1 Feb 1999, Rick Kwan wrote:
> 
> > At the bottom of his message, Klaus asks:
> > > 
> > > Does this make sense to anyone? :)
> > > 
> > >     Klaus
> > 
> > I think I caught most of it.  But let me see if I understand
> > a couple of things.
> > 
> >     1.	This sounds like HTML files must be encoded in UTF-8.
> 
> Actually, no; I was concentrating on the central piece of SGML(like-)
> parsing, and examining (well, more or less just thinking loud) what
> it would take to make that central part deal with the various charset
> situations in a clean manner.  The SGML parsing would deal only with
> one kind of representation (which happens to be the one defined as the
> Socument character set of HTML in SGML terms), but the representation
> in files could be completely different.  All that UTF-8 in files would
> give you is that the conversion UTF-8 -> UCS-{2,4} is simple, but there
> could be an arbitrary number of file-character-encoding -> UCS converters
> based on mapping tables or whatever else is needed.
> 
> Once you have a pair of converter pairs charsetA <-> UCS and charsetB <->
> UCS, you get the possibily to transcode charsetA <-> charsetB kind of
> automatically.

Interesting.  This may give us JIS<->shift JIS and EUC/CNS<->Big-5
just by doing the conversions in SGML.c.  Actually, this may even
give us JIS<->Big-5 (Japanese<->Chinese) for Kanji in both codesets.
(If I am not mistaken, GB (GB-2312) is Chinese but has Hiragana as
well, which has even more interesting possibilities.)

> Since it's (unfortunately) completely unrealistic to expect mapping
> tables for a given charset to be available, or even to expect that
> the charset of external data is known, there should then be a default
> way to sneak data through the SGML parser without mutilating them
> (a reversible tranformation), and that could be done nicely with
> a trivial mapping to/from some private zone.  It would remove the
> need for the various states in SGML_write within `#ifdef ISO_2022_JP'
> (they aren't really enough anyway - multibyte characters in attribute
> values aren't handled), there wouldn't be a '<' or <"> char that is
> part of a multibyte being misinterpreted as having parsing significance
> since it would have been tranformed when SGML_write sees it.

I think I understand that.  This is talking about, for example,
JIS->UCS->JIS without losing user-defined or other characters not
yet defined in UCS.

> >     2.	A lot of currently single-byte routines need to be
> > 	converted to handle 16-bit or 32-bit Unicode characters.
> 
> Not that many, if the goal is only to have SGML.c charset-clean;
> some (registrable) streams and functions for conversion, to act as
> "adapters" before and after the SGML stage; the rest could (but need
> not) remain all C strings.

Yeah.  I'll agree it's not too bad if we stick to SGML.c.

> > My personal comments on these matters:
> > 
> >     1.	UTF-8 is nice, but most Asian HTML files will be written in
> > 	national codesets, e.g., KSC-5601, JIS or shift-JIS, Big-5
> > 	or EUC-CNS (euc-tw), or GB.  These cannot be ignored in
> > 	preference to UTF-8 because most authors won't have UTF-8
> > 	tools.
> 
> Yes, I am aware of that reality.  I didn't mean to force UTF-8 or
> raw UCS on anybody - this could be invisible to the user, if UCS (or
> some fake-UCS) used only internally.
> 
> >     2.	I am ambivalent about the development and performance
> > 	tradeoffs between single-width Unicode vs multi-byte
> > 	codesets.  I agree that you don't want to bloat statically
> > 	linked code with multiple codesets; this results in
> > 	re-compilation and re-link each time a new language is
> > 	supported.  A dynamically loadable solution is preferred.
> 
> There is of course some performance penalty for wide characters;
> but if they are only used temporarily - and SGML.c already just
> streams the data thru, and has to look at each character individually -
> it shouldn't be much compared to all the other precessing necessary
> for text.  (And if you're not dealing with text, or don't want it
> parsed/rendered, you don't need to do any of it.)
> 
> I'm kind of thinking of libwww as an example framework that tries to get
> it right in a flexible way, not necessarily in the most efficient and
> hackerish way (except at the HTTP protocol level maybe).

I agree on the "framework" observation.  I realize that my
desire for dynamically loadable solution is going to be hard to
do portably.  It probably makes more sense to build a statically
linked model, prototyping selected locales.

Perhaps the way to handle this is as a separate library,
similarly to how PICS-client is handled.  Part of the reason is
because there are potentially large tables (various<->UCS) to be
included.  Another part is due to rendering issues -- more on this
below.

> 
> >     3.	This may be obvious to many:  as far as linemode browser
> > 	is concerned, there is work to be done both in SGML.c and in
> > 	places like HTBrowse.c, where text presentation takes place.
> > 	Visual width and text string width are not the same thing.
> 
> I know very little about that...

Well, I ...  Let's just say I did my time. :-)

The key issue I can think if is where to break a line.  In
Chinese, characters (words) are strung together in a sentence
with no intervening blanks.  Conversely, Korean does have blanks
as word separators.  (I don't know enough about Japanese.)
These locale-specific rules need to coordinated with the output
of SGML.c.

To complicate matters a bit more, Thai (multiple characters in a
single column) and Hebrew or Arabic (bi-directional text) add
their own unique complexities.

This begins to argue for a linemode browser with callbacks for
locale-dependent rendering considerations.  This is not as bad
as X11; we can assume that X11 or a terminal is providing fonts
of known simple widths (e.g., ASCII==1, most Asian==2).

> 
> > I've been silent about this until now because, having done
> > some Unicode and multi-byte work, the stuff scares me to death!
> > But, yes, I agree that multi-lingual support would be a
> > nice thing to see happen.
> 
> Well, I can't read any of the scripts that need multi-byte characters,
> so everything looks just like a stream of bits which can be transformed
> into another stream of bits - makes it less scary. :)
> I am of course avoiding all the different questions of rendering, just
> talking about a part that IMHO _should_ be handled in a generic manner:
> provide a way to get the text data parsed, and if possible converted
> to the preferred character set, to HTML.c and HTBrowse.c (or equivalents),
> the rest is then up to whoever wants to really implement the rendering
> with all the necessary script-/directionality-/font-/language-specific
> extra knowledge.
> 
> Just some thoughts triggered by the question...
> 
>    Klaus

And before I dig myself a deep hole and jump in, my reading of
other scripts is very poor...  But the people who do read them
keep coming back to beat up on me!

--Rick Kwan
Received on Wednesday, 3 February 1999 14:47:47 UTC