On 11 Dec 2006 at 16:47, Tania Estébanez said: > > Hello > > I'm using the command line version of Tidy for Windows XP [1] to convert > HTML to XHTML. I've only used it a couple of times, so I'm new to this > program. > > While converting some HTML documents, I've seen that every time I do it, > Tidy modifies the charset attribute from the meta tag. Instead of keeping > its original value (EUC-JP or UTF-8, for example), it always puts the > value "US-ASCII". Why does this happen? Is there any way to prevent this > from happening? > > I'm not using any config file at all, and the command I use is: > > tidy -asxhtml inputFile.txt > outputFile.xml The short answer is that US-ASCII is correct for what you have asked Tidy to create. The charset parameter specifies how to interpret "high-bit" characters, i.e. those that fall outside the basic US-ASCII character set (which is a common subset of all the valid charsets, apart from the EBCDIC-based ones that you hardly ever see). With no overrides, Tidy will be creating "7-bit" output, and replacing all "extended" characters by their entity names or numeric equivalents. There will not be any "high-bit" bytes present, so it is not appropriate to specify how they should be interpreted. If you really need "8-bit" output, there is a limited selection of parameters to force Tidy to use particular character sets. I don't think EUC-JP is one of them, though.Received on Tuesday, 12 December 2006 09:25:17 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 18:21:37 GMT