- From: Yvon Thoraval <yvon_thoraval@mac.com>
- Date: Fri, 22 Dec 2006 07:00:21 +0100
- To: Tania Estébanez <fair_ithilien@yahoo.es>
- Cc: html-tidy@w3.org
- Message-Id: <0A47F5CD-3A91-4D5E-A831-48AE80DCA4B6@mac.com>
Le jeudi 21 déc. 06 19:06 à 22:07, Tania Estébanez a écrit :
> Can you tell us about any program that does that?
>
not a program rather a language : Ruby.
as u certainly know Ruby comes from Japan there they have to face
right with encoding prob...
transcoding is one thing,
guessing what's the TRUE encoding is other.
the first task (transcoding) is really easy within ruby (the same
does apply to perl, which i don't know as deeply as Ruby).
now suppose u've the "code" (ie. true encoding of a given file)
transcoding is as easy as :
cd=Iconv.new("UTF-8",code) # this supposes u want to transcode <code>
to UTF-8
then, you read all the lines of the file <file> :
out=""
file="/path/to/the/input/file.html"
File.open(file).each {|line| out << cd.iconv(line) }
right now out is transcoded to UTF-8
u've only to save the result :
file_new="/path/to/the/output/file.html"
out << cd.iconv(nil)
file_new_ref=File.open(file_new,File::WRONLY|File::CREAT|File::TRUNC)
file_new_ref.pos=0
file_new_ref.print out
file_new_ref.truncate(fiel_new_ref.pos)
that's all if and only if you're absolutly sure about the file encoding.
however, if u aren't abolutly certain of file encoding, it's another
sport...
FIRST it's better NOT to rely upon the more or less fake html
directive :
<meta ... content="... charset=<the supposed charset">
because this directive is too ofently WRONG because, for example, the
server from where u get this file could have transcoded by itself the
file without affecting the writen charset...
then, u have to GUESS encoding which isn't easy
i give u here only some rules to do that properly, if enterested,
more deeply, let me know, i'll write a web-page for that with
examples...
FIRST, if u're lucky detect if the encoding as ASCII (but beware they
are a lot of people speaking about ASCII encoding which not TRUE
ASCII (7 bit)encoding, they are speaking about EXTENDED ASCII (8bit))
U've to start to guess if your input file is - or not - of the 7 bit
ASCII one, if true your lucky because u don't have to transcode to
UTF-8 because 7bit-ASCII is the low order bytes of UTF-8.
Also u're lucky because guessing 7bit-ASCII could be done using a
simple regexp and this guessing is INDUBITABLE.
SECOND if not 7bit-ASCII guess if the file is - or not - of UTF-8
here the regexp is more complicated BUT is also INDUBITABLE.
(in order your guess works better, strip any tag from your html input
because tags are pure 7bit-ASCII, keep only the text)
then, after ASCII and UTF-8 detection u face with the right prob ))
FIRST thing to say : here you're to a point where encoding is
THEORICALLY IMPOSSIBLE...
but their is a workaround.
as for utf-8 don't work over the whole content of your input file
rather only on the text part, wipe out any tags...
what's the workaround ?
instead of detecting encoding u'll detect the language used in text
part of the file, this is done statistically by comparing the input
text with a reference text choosen for that purpose.
the trick here is u could have severall refrence text in different
files for the same language used, the difference between the file
being encoding, suppose u want to the detect only pages being written
in french as a language but u don't know whicn encoding, then u will
have severall references text files like that :
french_ISO-8859-1.lm
french-Mac-Roman.lm
the same for hungarian, various japanese, chinese etc...
then using "enca" u will find the BEST fit between the text of your
input file and the refernce text properly encoded, the output is
statistical, generally u choose the best fit that's to say the max
value of the fit.
notice this could fail to retrrieve the couple (language, encoding)
if all the files u've doesn't cover the language, encoding u're
guessing...
after that u've the most probable encoding of your file, then u can
apply the trabscoding as given above...
best,
Yvon
Received on Friday, 22 December 2006 06:00:25 UTC