Re: additional codesets and linemode browser from Henrik Frystyk Nielsen on 1999-02-05 (www-lib@w3.org from January to March 1999)

From: Henrik Frystyk Nielsen <frystyk@w3.org>
Date: Fri, 05 Feb 1999 14:13:35 -0500
To: Klaus Weide <kweide@tezcat.com>
Cc: www-lib@w3.org
Message-Id: <3.0.5.32.19990205141335.0307f100@localhost>
At 03:09 1/29/99 -0600, Klaus Weide wrote:
>On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote:
>
>> Here is one for all of you - the current SGML/HTML parser is 8bit only.
>> Anyone interested in expanding it to support larger charsets?
>
>I am in part responsible for the mess SGML.c (once based on libwww
>2.x) has become in the code for Lynx (<http://sol.slcc.edu/lynx/current/
>lynx2-8-2/WWW/Library/Implementation/SGML.c>), to add support for
>character set translation.  Based on that experience, I wouldn't recommend
>the Library to follow the same path, i.e. add more special hacks for
>specific charsets to SGML.c.

I fully agree that hacking anymore on the current SGML module is not a good
idea.

>I think nowadays SGML.c should become based on Unicode or ISO 10646.
>The SGML parser object should provide interfaces for feeding it
>"characters" instead of bytes, where characters would be expressed 
>as 2- or 4-byte codes.  That means it couldn't be just a regular
>HTStream - HTStream should be "subclassed" to a HTCharacterStream
>with additional methods, at least an additional
>
>    int (*put_unicode)(HTStream * me, int unicode);
>
>method.  (I assume HTCharacterStream should also provide the "old"
>char-based entry points for fallback, although that may not even
>be necessary.)  The new SGML object would also provide its output
>as "characters" (unicode values) not "bytes".

Yep - although don't you think that the HTCharacterStream is really just a
new version of the HTStructured stream, so that we have

   stream-of-octets  -->  stream-of-structured-characters --> stream-of-octets

That is, the structured parser must not throw anything away.

Likewise, I think it would be useful to have a dynamic array handling of
conversions. This could be added using some conversion mechanisms to
HTChunk and then have the internal representation be unicode.

>In analogy to the existing converters for other dimensions (MIME type,
>encodings), CharsetConverters should be added.  Actually they should
>probably come in pairs (as for encoding/decoding), one HTStream
>which feeds a HTCharacterStream (converting stream-of-bytes to
>stream-of-(unicode-)characters), and one HTCharacterStream which feeds
>a HTStream (for the opposite direction).  The Library should provide
>a trivial default implementation of each, which need not be aware of
>any specific charset (except probably the ASCII subset), but should
>by default provide completely reversible transformations; i.e.
>
>  stream-of-octets  -->  stream-of-characters  -->  stream-of-octets

There is a simple version of such converter streams for handling CRLF in
the MIME world

	http://www.w3.org/Library/src/HTNetTxt.c

>should give exactly the original octets.  This could be achieved
>by (1) keeping 7-bit characters as they are (just cast char to int),
>(2) mapping all 8-bit characters to a private zone in the unicode
>character space.  (e.g. unicode(c) = (0xF000|(unsigned char)(c)) 
>A slightly more complicated converter might be aware of EUC or
>ISO-2022 character encoding rules (but not specific actual to-Unicode
>character mappings) and "protect" characters from misinterpretation
>by the Unicode-handling SGML object the same way.
>
>The user application could then register charset converters for
>specific charsets which would override the default behavior, doing
>the correct octet -> unicode mapping (and, for the reverse direction,
>whatever is more appropriate for the application - for example
>replacing unrecognized unicodes by '?' etc.)
>
>The biggest change would have to be in the interface to the
>HTStructured object.  Either HTStructured's methods should be
>changed to take unicodes instead of chars, or, for reuse of existing
>implementations, there could be a HTCharacterStructured -> 
>HTOldStructured back-converter.  The unicode-handling SGML object
>would not actually have to feed "wide character" strings to the
>next stage (HTStructured), it could do a conversion to UTF-8 on
>some or all of its output.

I think we can do whatever we like :) - I recently changed the structured
stream to not throw away information that is not listed in the DTD.

What about using an already existing SGML parser?

Thanks for your comments - they are most valuable. Would you be interested
in helping looking into this work?

Henrik
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk
Received on Friday, 5 February 1999 14:13:40 UTC