Tidy messes up charset

Hello,

I'm working on the Hungarian translation of the FreeBSD webpages, which 
are written in SGML/XSLT. After the sgmlnorm normalizes the sgml files, 
the output is processed by tidy with the following options:

-wrap 90 -m -raw -preserve -f /dev/null -asxml

As a result ® ™ and © are substituted wrongly if I use 
iso-8859-2 character set. When using utf-8, almost all entities 
(á etc.) show up up weirdly. The tidy help says:

 -raw                output values above 127 without conversion to entities

But the entities are actually substituted. If I disable tidy, everything 
is fine, but I need tidy to make the sources W3C valid. There are 
specific things (like custom SGML DOCTYPE declaration) that should be 
processed by tidy for validity. Could somebody tell me what I'm doing 
wrong? I can publish the SGML sources on demand or provide a log about 
the processing of those files.
The version that I use:
[root@server /usr/www/en]# tidy -v
HTML Tidy for FreeBSD released on 1 September 2005
I also experienced the same with the current CVS version.
Thanks in advance.

-- 
Cheers,

Gabor

P.S.: Please CC me when reply, since I'm not subscribed to the list.

Received on Sunday, 1 October 2006 03:09:12 UTC