W3C home > Mailing lists > Public > www-lib@w3.org > January to March 1999

Re: How to Convert Korean language from HTML to Text

From: Raffaele Sena <raff@nuvomedia.com>
Date: Thu, 28 Jan 1999 10:52:15 -0800 (PST)
To: Henrik Frystyk Nielsen <frystyk@w3.org>
cc: "Sangyeob. Lee" <sangyeob@lgtel.co.kr>, www-lib@w3.org
Message-ID: <Pine.LNX.4.04.9901281035070.20894-100000@localhost>
On Mon, 25 Jan 1999, Henrik Frystyk Nielsen wrote:

> 
> Here is one for all of you - the current SGML/HTML parser is 8bit only.
> Anyone interested in expanding it to support larger charsets?
> 
> Henrik
> 
	My previous product was (well, it still is. It's just me that
	moved to a different company :) an Internet set-top box based
	on the now "venerand" libwww 2.17.

	The browser now supports Korean, Chinese GB and BIG5
	and Japanese JIS, SJIS and EUC (all converted into EUC for
	the specific language).

	If I remember well, the places I touched the least were the
	SGML and HTML parsers. The rendering engine took care of
	displaying every 2 bytes as a single character.

	A couple of changes I had to make: 

          * The code for &nbsp; had to be changed to different
            value, since it was ending in the range for the EUC
            characters (0xA1-0xA1 to 0xFE-0xFE).

          * The code in GridText.c (that I used as front-end to
            the graphical front-end engine) had to be made a little
            more complex in order to break words composed of ideograms
            when they were becoming too long (since it seems to be no
            separation between them, and any place is good to break
            them - of course not in the middle of a single 2-bytes
            character).        

	I don't remember anything else on the top of my head. If I can
	find some time this weekend, I'll try the sample page in the
	original message, and I'll see if there is a "quick fix" or
	some simple code that would make the parser support dual-byte
	character sets.

-- Raffaele

---------------------------------------------
Raffaele Sena
Senior Software Engineer ( "THE" Linux Guy :)
NuvoMedia, Inc.
310 Villa Street
Mt. View, CA 94041
Main:   +1.650.314.1200
Direct: +1.650.314.1255
Fax:    +1.650.314.1201

mailto:raff@nuvomedia.com
http://www.rocket-ebook.com
Received on Thursday, 28 January 1999 13:52:39 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:18:28 GMT