Re: HTML Tidy and charset

On 11 Dec 2006 at 16:47, Tania Estébanez said:

> 
> Hello
> 
> I'm using the command line version of Tidy for Windows XP [1] to convert
> HTML to XHTML. I've only used it a couple of times, so I'm new to this
> program.
> 
> While converting some HTML documents, I've seen that every time I do it,
> Tidy modifies the charset attribute from the meta tag. Instead of keeping
> its original value (EUC-JP or UTF-8, for example), it always puts the
> value "US-ASCII". Why does this happen? Is there any way to prevent this
> from happening?
> 
> I'm not using any config file at all, and the command I use is:
> 
> tidy -asxhtml inputFile.txt > outputFile.xml

The short answer is that US-ASCII is correct for what you have asked Tidy 
to create.

The charset parameter specifies how to interpret "high-bit" characters, 
i.e. those that fall outside the basic US-ASCII character set (which is a 
common subset of all the valid charsets, apart from the EBCDIC-based ones 
that you hardly ever see).

With no overrides, Tidy will be creating "7-bit" output, and replacing 
all "extended" characters by their entity names or numeric equivalents. 
There will not be any "high-bit" bytes present, so it is not appropriate 
to specify how they should be interpreted.

If you really need "8-bit" output, there is a limited selection of 
parameters to force Tidy to use particular character sets. I don't think 
EUC-JP is one of them, though.

Received on Tuesday, 12 December 2006 09:25:17 UTC