- From: Klaus Weide <kweide@tezcat.com>
- Date: Fri, 29 Jan 1999 03:09:17 -0600 (CST)
- To: Henrik Frystyk Nielsen <frystyk@w3.org>
- cc: www-lib@w3.org
On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote: > Here is one for all of you - the current SGML/HTML parser is 8bit only. > Anyone interested in expanding it to support larger charsets? I am in part responsible for the mess SGML.c (once based on libwww 2.x) has become in the code for Lynx (<http://sol.slcc.edu/lynx/current/ lynx2-8-2/WWW/Library/Implementation/SGML.c>), to add support for character set translation. Based on that experience, I wouldn't recommend the Library to follow the same path, i.e. add more special hacks for specific charsets to SGML.c. I think nowadays SGML.c should become based on Unicode or ISO 10646. The SGML parser object should provide interfaces for feeding it "characters" instead of bytes, where characters would be expressed as 2- or 4-byte codes. That means it couldn't be just a regular HTStream - HTStream should be "subclassed" to a HTCharacterStream with additional methods, at least an additional int (*put_unicode)(HTStream * me, int unicode); method. (I assume HTCharacterStream should also provide the "old" char-based entry points for fallback, although that may not even be necessary.) The new SGML object would also provide its output as "characters" (unicode values) not "bytes". In analogy to the existing converters for other dimensions (MIME type, encodings), CharsetConverters should be added. Actually they should probably come in pairs (as for encoding/decoding), one HTStream which feeds a HTCharacterStream (converting stream-of-bytes to stream-of-(unicode-)characters), and one HTCharacterStream which feeds a HTStream (for the opposite direction). The Library should provide a trivial default implementation of each, which need not be aware of any specific charset (except probably the ASCII subset), but should by default provide completely reversible transformations; i.e. stream-of-octets --> stream-of-characters --> stream-of-octets should give exactly the original octets. This could be achieved by (1) keeping 7-bit characters as they are (just cast char to int), (2) mapping all 8-bit characters to a private zone in the unicode character space. (e.g. unicode(c) = (0xF000|(unsigned char)(c)) A slightly more complicated converter might be aware of EUC or ISO-2022 character encoding rules (but not specific actual to-Unicode character mappings) and "protect" characters from misinterpretation by the Unicode-handling SGML object the same way. The user application could then register charset converters for specific charsets which would override the default behavior, doing the correct octet -> unicode mapping (and, for the reverse direction, whatever is more appropriate for the application - for example replacing unrecognized unicodes by '?' etc.) The biggest change would have to be in the interface to the HTStructured object. Either HTStructured's methods should be changed to take unicodes instead of chars, or, for reuse of existing implementations, there could be a HTCharacterStructured -> HTOldStructured back-converter. The unicode-handling SGML object would not actually have to feed "wide character" strings to the next stage (HTStructured), it could do a conversion to UTF-8 on some or all of its output. Does this make sense to anyone? :) Klaus
Received on Friday, 29 January 1999 04:09:21 UTC