- From: Rick Kwan <kenobi@coruscant.lightsaber.com>
- Date: Mon, 1 Feb 1999 21:06:07 -0500 (EST)
- To: www-lib@w3.org
At the bottom of his message, Klaus asks: > > Does this make sense to anyone? :) > > Klaus I think I caught most of it. But let me see if I understand a couple of things. 1. This sounds like HTML files must be encoded in UTF-8. 2. A lot of currently single-byte routines need to be converted to handle 16-bit or 32-bit Unicode characters. My personal comments on these matters: 1. UTF-8 is nice, but most Asian HTML files will be written in national codesets, e.g., KSC-5601, JIS or shift-JIS, Big-5 or EUC-CNS (euc-tw), or GB. These cannot be ignored in preference to UTF-8 because most authors won't have UTF-8 tools. 2. I am ambivalent about the development and performance tradeoffs between single-width Unicode vs multi-byte codesets. I agree that you don't want to bloat statically linked code with multiple codesets; this results in re-compilation and re-link each time a new language is supported. A dynamically loadable solution is preferred. 3. This may be obvious to many: as far as linemode browser is concerned, there is work to be done both in SGML.c and in places like HTBrowse.c, where text presentation takes place. Visual width and text string width are not the same thing. I've been silent about this until now because, having done some Unicode and multi-byte work, the stuff scares me to death! But, yes, I agree that multi-lingual support would be a nice thing to see happen. --Rick Kwan > On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote: > > > Here is one for all of you - the current SGML/HTML parser is 8bit only. > > Anyone interested in expanding it to support larger charsets? > > I am in part responsible for the mess SGML.c (once based on libwww > 2.x) has become in the code for Lynx (<http://sol.slcc.edu/lynx/current/ > lynx2-8-2/WWW/Library/Implementation/SGML.c>), to add support for > character set translation. Based on that experience, I wouldn't recommend > the Library to follow the same path, i.e. add more special hacks for > specific charsets to SGML.c. > > I think nowadays SGML.c should become based on Unicode or ISO 10646. > The SGML parser object should provide interfaces for feeding it > "characters" instead of bytes, where characters would be expressed > as 2- or 4-byte codes. That means it couldn't be just a regular > HTStream - HTStream should be "subclassed" to a HTCharacterStream > with additional methods, at least an additional > > int (*put_unicode)(HTStream * me, int unicode); > > method. (I assume HTCharacterStream should also provide the "old" > char-based entry points for fallback, although that may not even > be necessary.) The new SGML object would also provide its output > as "characters" (unicode values) not "bytes". > > In analogy to the existing converters for other dimensions (MIME type, > encodings), CharsetConverters should be added. Actually they should > probably come in pairs (as for encoding/decoding), one HTStream > which feeds a HTCharacterStream (converting stream-of-bytes to > stream-of-(unicode-)characters), and one HTCharacterStream which feeds > a HTStream (for the opposite direction). The Library should provide > a trivial default implementation of each, which need not be aware of > any specific charset (except probably the ASCII subset), but should > by default provide completely reversible transformations; i.e. > > stream-of-octets --> stream-of-characters --> stream-of-octets > > should give exactly the original octets. This could be achieved > by (1) keeping 7-bit characters as they are (just cast char to int), > (2) mapping all 8-bit characters to a private zone in the unicode > character space. (e.g. unicode(c) = (0xF000|(unsigned char)(c)) > A slightly more complicated converter might be aware of EUC or > ISO-2022 character encoding rules (but not specific actual to-Unicode > character mappings) and "protect" characters from misinterpretation > by the Unicode-handling SGML object the same way. > > The user application could then register charset converters for > specific charsets which would override the default behavior, doing > the correct octet -> unicode mapping (and, for the reverse direction, > whatever is more appropriate for the application - for example > replacing unrecognized unicodes by '?' etc.) > > The biggest change would have to be in the interface to the > HTStructured object. Either HTStructured's methods should be > changed to take unicodes instead of chars, or, for reuse of existing > implementations, there could be a HTCharacterStructured -> > HTOldStructured back-converter. The unicode-handling SGML object > would not actually have to feed "wide character" strings to the > next stage (HTStructured), it could do a conversion to UTF-8 on > some or all of its output. > > Does this make sense to anyone? :) > > Klaus
Received on Monday, 1 February 1999 21:38:57 UTC