- From: Raffaele Sena <raff@nuvomedia.com>
- Date: Thu, 28 Jan 1999 10:52:15 -0800 (PST)
- To: Henrik Frystyk Nielsen <frystyk@w3.org>
- cc: "Sangyeob. Lee" <sangyeob@lgtel.co.kr>, www-lib@w3.org
On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote: > > Here is one for all of you - the current SGML/HTML parser is 8bit only. > Anyone interested in expanding it to support larger charsets? > > Henrik > My previous product was (well, it still is. It's just me that moved to a different company :) an Internet set-top box based on the now "venerand" libwww 2.17. The browser now supports Korean, Chinese GB and BIG5 and Japanese JIS, SJIS and EUC (all converted into EUC for the specific language). If I remember well, the places I touched the least were the SGML and HTML parsers. The rendering engine took care of displaying every 2 bytes as a single character. A couple of changes I had to make: * The code for had to be changed to different value, since it was ending in the range for the EUC characters (0xA1-0xA1 to 0xFE-0xFE). * The code in GridText.c (that I used as front-end to the graphical front-end engine) had to be made a little more complex in order to break words composed of ideograms when they were becoming too long (since it seems to be no separation between them, and any place is good to break them - of course not in the middle of a single 2-bytes character). I don't remember anything else on the top of my head. If I can find some time this weekend, I'll try the sample page in the original message, and I'll see if there is a "quick fix" or some simple code that would make the parser support dual-byte character sets. -- Raffaele --------------------------------------------- Raffaele Sena Senior Software Engineer ( "THE" Linux Guy :) NuvoMedia, Inc. 310 Villa Street Mt. View, CA 94041 Main: +1.650.314.1200 Direct: +1.650.314.1255 Fax: +1.650.314.1201 mailto:raff@nuvomedia.com http://www.rocket-ebook.com
Received on Thursday, 28 January 1999 13:52:39 UTC