W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2006

Re: Unicode conversion using tudyHTML

From: Yvon Thoraval <yvon_thoraval@mac.com>
Date: Fri, 22 Dec 2006 07:00:21 +0100
Message-Id: <0A47F5CD-3A91-4D5E-A831-48AE80DCA4B6@mac.com>
Cc: html-tidy@w3.org
To: Tania Estébanez <fair_ithilien@yahoo.es>

Le jeudi 21 déc. 06 19:06 à 22:07, Tania Estébanez a écrit :

> Can you tell us about any program  that does that?

not a program rather a language : Ruby.

as u certainly know Ruby comes from Japan there they have to face  
right with encoding prob...

transcoding is one thing,

guessing what's the TRUE encoding is other.

the first task (transcoding) is really easy within ruby (the same  
does apply to perl, which i don't know as deeply as Ruby).

now suppose u've the "code" (ie. true encoding of a given file)  
transcoding is as easy as :

cd=Iconv.new("UTF-8",code) # this supposes u want to transcode <code>  
to UTF-8

then, you read all the lines of the file <file> :

File.open(file).each {|line| out << cd.iconv(line) }

right now out is transcoded to UTF-8

u've only to save the result :


out << cd.iconv(nil)


file_new_ref.print out

that's all if and only if you're absolutly sure about the file encoding.

however, if u aren't abolutly certain of file encoding, it's another  

FIRST it's better NOT to rely upon the more or less fake html  
directive :

<meta ... content="... charset=<the supposed charset">

because this directive is too ofently WRONG because, for example, the  
server from where u get this file could have transcoded by itself the  
file without affecting the writen charset...

then, u have to GUESS encoding which isn't easy

i give u here only some rules to do that properly, if enterested,  
more deeply, let me know, i'll write a web-page for that with  

FIRST, if u're lucky detect if the encoding as ASCII (but beware they  
are a lot of people speaking about ASCII encoding which not TRUE  
ASCII (7 bit)encoding, they are speaking about EXTENDED ASCII (8bit))

U've to start to guess if your input file is - or not - of the 7 bit  
ASCII one, if true your lucky because u don't have to transcode to  
UTF-8 because 7bit-ASCII is the low order bytes of UTF-8.

Also u're lucky because guessing 7bit-ASCII could be done using a  
simple regexp and this guessing is INDUBITABLE.

SECOND if not 7bit-ASCII guess if the file is - or not - of UTF-8  
here the regexp is more complicated BUT is also INDUBITABLE.
(in order your guess works better, strip any tag from your html input  
because tags are pure 7bit-ASCII, keep only the text)

then, after ASCII and UTF-8 detection u face with the right prob ))

FIRST thing to say : here you're to a point where encoding is  

but their is a workaround.

as for utf-8 don't work over the whole content of your input file  
rather only on the text part, wipe out any tags...

what's the workaround ?

instead of detecting encoding u'll detect the language used in text  
part of the file, this is done statistically by comparing the input  
text with a reference text choosen for that purpose.

the trick here is u could have severall refrence text in different  
files for the same language used, the difference between the file  
being encoding, suppose u want to the detect only pages being written  
in french as a language but u don't know whicn encoding, then u will  
have severall references text files like that :


the same for hungarian, various japanese, chinese etc...

then using "enca" u will find the BEST fit between the text of your  
input file and the refernce text properly encoded, the output is  
statistical, generally u choose the best fit that's to say the max  
value of the fit.

notice this could fail to retrrieve the couple (language, encoding)  
if all the files u've doesn't cover the language, encoding u're  

after that u've the most probable encoding of your file, then u can  
apply the trabscoding as given above...


Received on Friday, 22 December 2006 06:00:25 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:56 UTC