Re: How to Convert Korean language from HTML to Text from Klaus Weide on 1999-01-29 (www-lib@w3.org from January to March 1999)

From: Klaus Weide <kweide@tezcat.com>
Date: Fri, 29 Jan 1999 03:09:17 -0600 (CST)
To: Henrik Frystyk Nielsen <frystyk@w3.org>
cc: www-lib@w3.org
Message-ID: <Pine.SUN.3.95.990129015729.23787D-100000@xochi.tezcat.com>

On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote:

> Here is one for all of you - the current SGML/HTML parser is 8bit only.
> Anyone interested in expanding it to support larger charsets?

I am in part responsible for the mess SGML.c (once based on libwww
2.x) has become in the code for Lynx (<http://sol.slcc.edu/lynx/current/
lynx2-8-2/WWW/Library/Implementation/SGML.c>), to add support for
character set translation.  Based on that experience, I wouldn't recommend
the Library to follow the same path, i.e. add more special hacks for
specific charsets to SGML.c.

I think nowadays SGML.c should become based on Unicode or ISO 10646.
The SGML parser object should provide interfaces for feeding it
"characters" instead of bytes, where characters would be expressed 
as 2- or 4-byte codes.  That means it couldn't be just a regular
HTStream - HTStream should be "subclassed" to a HTCharacterStream
with additional methods, at least an additional

    int (*put_unicode)(HTStream * me, int unicode);

method.  (I assume HTCharacterStream should also provide the "old"
char-based entry points for fallback, although that may not even
be necessary.)  The new SGML object would also provide its output
as "characters" (unicode values) not "bytes".

In analogy to the existing converters for other dimensions (MIME type,
encodings), CharsetConverters should be added.  Actually they should
probably come in pairs (as for encoding/decoding), one HTStream
which feeds a HTCharacterStream (converting stream-of-bytes to
stream-of-(unicode-)characters), and one HTCharacterStream which feeds
a HTStream (for the opposite direction).  The Library should provide
a trivial default implementation of each, which need not be aware of
any specific charset (except probably the ASCII subset), but should
by default provide completely reversible transformations; i.e.

  stream-of-octets  -->  stream-of-characters  -->  stream-of-octets

should give exactly the original octets.  This could be achieved
by (1) keeping 7-bit characters as they are (just cast char to int),
(2) mapping all 8-bit characters to a private zone in the unicode
character space.  (e.g. unicode(c) = (0xF000|(unsigned char)(c)) 
A slightly more complicated converter might be aware of EUC or
ISO-2022 character encoding rules (but not specific actual to-Unicode
character mappings) and "protect" characters from misinterpretation
by the Unicode-handling SGML object the same way.

The user application could then register charset converters for
specific charsets which would override the default behavior, doing
the correct octet -> unicode mapping (and, for the reverse direction,
whatever is more appropriate for the application - for example
replacing unrecognized unicodes by '?' etc.)

The biggest change would have to be in the interface to the
HTStructured object.  Either HTStructured's methods should be
changed to take unicodes instead of chars, or, for reuse of existing
implementations, there could be a HTCharacterStructured -> 
HTOldStructured back-converter.  The unicode-handling SGML object
would not actually have to feed "wide character" strings to the
next stage (HTStructured), it could do a conversion to UTF-8 on
some or all of its output.

Does this make sense to anyone? :)

    Klaus

Received on Friday, 29 January 1999 04:09:21 UTC