- From: Henrik Frystyk Nielsen <frystyk@w3.org>
- Date: Fri, 05 Feb 1999 14:13:35 -0500
- To: Klaus Weide <kweide@tezcat.com>
- Cc: www-lib@w3.org
At 03:09 1/29/99 -0600, Klaus Weide wrote: >On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote: > >> Here is one for all of you - the current SGML/HTML parser is 8bit only. >> Anyone interested in expanding it to support larger charsets? > >I am in part responsible for the mess SGML.c (once based on libwww >2.x) has become in the code for Lynx (<http://sol.slcc.edu/lynx/current/ >lynx2-8-2/WWW/Library/Implementation/SGML.c>), to add support for >character set translation. Based on that experience, I wouldn't recommend >the Library to follow the same path, i.e. add more special hacks for >specific charsets to SGML.c. I fully agree that hacking anymore on the current SGML module is not a good idea. >I think nowadays SGML.c should become based on Unicode or ISO 10646. >The SGML parser object should provide interfaces for feeding it >"characters" instead of bytes, where characters would be expressed >as 2- or 4-byte codes. That means it couldn't be just a regular >HTStream - HTStream should be "subclassed" to a HTCharacterStream >with additional methods, at least an additional > > int (*put_unicode)(HTStream * me, int unicode); > >method. (I assume HTCharacterStream should also provide the "old" >char-based entry points for fallback, although that may not even >be necessary.) The new SGML object would also provide its output >as "characters" (unicode values) not "bytes". Yep - although don't you think that the HTCharacterStream is really just a new version of the HTStructured stream, so that we have stream-of-octets --> stream-of-structured-characters --> stream-of-octets That is, the structured parser must not throw anything away. Likewise, I think it would be useful to have a dynamic array handling of conversions. This could be added using some conversion mechanisms to HTChunk and then have the internal representation be unicode. >In analogy to the existing converters for other dimensions (MIME type, >encodings), CharsetConverters should be added. Actually they should >probably come in pairs (as for encoding/decoding), one HTStream >which feeds a HTCharacterStream (converting stream-of-bytes to >stream-of-(unicode-)characters), and one HTCharacterStream which feeds >a HTStream (for the opposite direction). The Library should provide >a trivial default implementation of each, which need not be aware of >any specific charset (except probably the ASCII subset), but should >by default provide completely reversible transformations; i.e. > > stream-of-octets --> stream-of-characters --> stream-of-octets There is a simple version of such converter streams for handling CRLF in the MIME world http://www.w3.org/Library/src/HTNetTxt.c >should give exactly the original octets. This could be achieved >by (1) keeping 7-bit characters as they are (just cast char to int), >(2) mapping all 8-bit characters to a private zone in the unicode >character space. (e.g. unicode(c) = (0xF000|(unsigned char)(c)) >A slightly more complicated converter might be aware of EUC or >ISO-2022 character encoding rules (but not specific actual to-Unicode >character mappings) and "protect" characters from misinterpretation >by the Unicode-handling SGML object the same way. > >The user application could then register charset converters for >specific charsets which would override the default behavior, doing >the correct octet -> unicode mapping (and, for the reverse direction, >whatever is more appropriate for the application - for example >replacing unrecognized unicodes by '?' etc.) > >The biggest change would have to be in the interface to the >HTStructured object. Either HTStructured's methods should be >changed to take unicodes instead of chars, or, for reuse of existing >implementations, there could be a HTCharacterStructured -> >HTOldStructured back-converter. The unicode-handling SGML object >would not actually have to feed "wide character" strings to the >next stage (HTStructured), it could do a conversion to UTF-8 on >some or all of its output. I think we can do whatever we like :) - I recently changed the structured stream to not throw away information that is not listed in the DTD. What about using an already existing SGML parser? Thanks for your comments - they are most valuable. Would you be interested in helping looking into this work? Henrik -- Henrik Frystyk Nielsen, World Wide Web Consortium http://www.w3.org/People/Frystyk
Received on Friday, 5 February 1999 14:13:40 UTC