- From: Raffaele Sena <raff@nuvomedia.com>
- Date: Thu, 28 Jan 1999 10:52:15 -0800 (PST)
- To: Henrik Frystyk Nielsen <frystyk@w3.org>
- cc: "Sangyeob. Lee" <sangyeob@lgtel.co.kr>, www-lib@w3.org
On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote:
>
> Here is one for all of you - the current SGML/HTML parser is 8bit only.
> Anyone interested in expanding it to support larger charsets?
>
> Henrik
>
My previous product was (well, it still is. It's just me that
moved to a different company :) an Internet set-top box based
on the now "venerand" libwww 2.17.
The browser now supports Korean, Chinese GB and BIG5
and Japanese JIS, SJIS and EUC (all converted into EUC for
the specific language).
If I remember well, the places I touched the least were the
SGML and HTML parsers. The rendering engine took care of
displaying every 2 bytes as a single character.
A couple of changes I had to make:
* The code for had to be changed to different
value, since it was ending in the range for the EUC
characters (0xA1-0xA1 to 0xFE-0xFE).
* The code in GridText.c (that I used as front-end to
the graphical front-end engine) had to be made a little
more complex in order to break words composed of ideograms
when they were becoming too long (since it seems to be no
separation between them, and any place is good to break
them - of course not in the middle of a single 2-bytes
character).
I don't remember anything else on the top of my head. If I can
find some time this weekend, I'll try the sample page in the
original message, and I'll see if there is a "quick fix" or
some simple code that would make the parser support dual-byte
character sets.
-- Raffaele
---------------------------------------------
Raffaele Sena
Senior Software Engineer ( "THE" Linux Guy :)
NuvoMedia, Inc.
310 Villa Street
Mt. View, CA 94041
Main: +1.650.314.1200
Direct: +1.650.314.1255
Fax: +1.650.314.1201
mailto:raff@nuvomedia.com
http://www.rocket-ebook.com
Received on Thursday, 28 January 1999 13:52:39 UTC