- From: Yvon Thoraval <yvon_thoraval@mac.com>
- Date: Fri, 22 Dec 2006 07:00:21 +0100
- To: Tania Estébanez <fair_ithilien@yahoo.es>
- Cc: html-tidy@w3.org
- Message-Id: <0A47F5CD-3A91-4D5E-A831-48AE80DCA4B6@mac.com>
Le jeudi 21 déc. 06 19:06 à 22:07, Tania Estébanez a écrit : > Can you tell us about any program that does that? > not a program rather a language : Ruby. as u certainly know Ruby comes from Japan there they have to face right with encoding prob... transcoding is one thing, guessing what's the TRUE encoding is other. the first task (transcoding) is really easy within ruby (the same does apply to perl, which i don't know as deeply as Ruby). now suppose u've the "code" (ie. true encoding of a given file) transcoding is as easy as : cd=Iconv.new("UTF-8",code) # this supposes u want to transcode <code> to UTF-8 then, you read all the lines of the file <file> : out="" file="/path/to/the/input/file.html" File.open(file).each {|line| out << cd.iconv(line) } right now out is transcoded to UTF-8 u've only to save the result : file_new="/path/to/the/output/file.html" out << cd.iconv(nil) file_new_ref=File.open(file_new,File::WRONLY|File::CREAT|File::TRUNC) file_new_ref.pos=0 file_new_ref.print out file_new_ref.truncate(fiel_new_ref.pos) that's all if and only if you're absolutly sure about the file encoding. however, if u aren't abolutly certain of file encoding, it's another sport... FIRST it's better NOT to rely upon the more or less fake html directive : <meta ... content="... charset=<the supposed charset"> because this directive is too ofently WRONG because, for example, the server from where u get this file could have transcoded by itself the file without affecting the writen charset... then, u have to GUESS encoding which isn't easy i give u here only some rules to do that properly, if enterested, more deeply, let me know, i'll write a web-page for that with examples... FIRST, if u're lucky detect if the encoding as ASCII (but beware they are a lot of people speaking about ASCII encoding which not TRUE ASCII (7 bit)encoding, they are speaking about EXTENDED ASCII (8bit)) U've to start to guess if your input file is - or not - of the 7 bit ASCII one, if true your lucky because u don't have to transcode to UTF-8 because 7bit-ASCII is the low order bytes of UTF-8. Also u're lucky because guessing 7bit-ASCII could be done using a simple regexp and this guessing is INDUBITABLE. SECOND if not 7bit-ASCII guess if the file is - or not - of UTF-8 here the regexp is more complicated BUT is also INDUBITABLE. (in order your guess works better, strip any tag from your html input because tags are pure 7bit-ASCII, keep only the text) then, after ASCII and UTF-8 detection u face with the right prob )) FIRST thing to say : here you're to a point where encoding is THEORICALLY IMPOSSIBLE... but their is a workaround. as for utf-8 don't work over the whole content of your input file rather only on the text part, wipe out any tags... what's the workaround ? instead of detecting encoding u'll detect the language used in text part of the file, this is done statistically by comparing the input text with a reference text choosen for that purpose. the trick here is u could have severall refrence text in different files for the same language used, the difference between the file being encoding, suppose u want to the detect only pages being written in french as a language but u don't know whicn encoding, then u will have severall references text files like that : french_ISO-8859-1.lm french-Mac-Roman.lm the same for hungarian, various japanese, chinese etc... then using "enca" u will find the BEST fit between the text of your input file and the refernce text properly encoded, the output is statistical, generally u choose the best fit that's to say the max value of the fit. notice this could fail to retrrieve the couple (language, encoding) if all the files u've doesn't cover the language, encoding u're guessing... after that u've the most probable encoding of your file, then u can apply the trabscoding as given above... best, Yvon
Received on Friday, 22 December 2006 06:00:25 UTC