- From: Fred Bone <Fred.Bone@dial.pipex.com>
- Date: Tue, 12 Dec 2006 09:24:35 -0000
- To: html-tidy@w3.org
On 11 Dec 2006 at 16:47, Tania Estébanez said: > > Hello > > I'm using the command line version of Tidy for Windows XP [1] to convert > HTML to XHTML. I've only used it a couple of times, so I'm new to this > program. > > While converting some HTML documents, I've seen that every time I do it, > Tidy modifies the charset attribute from the meta tag. Instead of keeping > its original value (EUC-JP or UTF-8, for example), it always puts the > value "US-ASCII". Why does this happen? Is there any way to prevent this > from happening? > > I'm not using any config file at all, and the command I use is: > > tidy -asxhtml inputFile.txt > outputFile.xml The short answer is that US-ASCII is correct for what you have asked Tidy to create. The charset parameter specifies how to interpret "high-bit" characters, i.e. those that fall outside the basic US-ASCII character set (which is a common subset of all the valid charsets, apart from the EBCDIC-based ones that you hardly ever see). With no overrides, Tidy will be creating "7-bit" output, and replacing all "extended" characters by their entity names or numeric equivalents. There will not be any "high-bit" bytes present, so it is not appropriate to specify how they should be interpreted. If you really need "8-bit" output, there is a limited selection of parameters to force Tidy to use particular character sets. I don't think EUC-JP is one of them, though.
Received on Tuesday, 12 December 2006 09:25:17 UTC