Tidy messes up charset from Gábor Kövesdán on 2006-09-30 (html-tidy@w3.org from October to December 2006)

From: Gábor Kövesdán <gabor@FreeBSD.org>
Date: Sat, 30 Sep 2006 15:45:55 +0200
To: html-tidy@w3.org
Message-ID: <451E7513.9080805@FreeBSD.org>

Hello,

I'm working on the Hungarian translation of the FreeBSD webpages, which 
are written in SGML/XSLT. After the sgmlnorm normalizes the sgml files, 
the output is processed by tidy with the following options:

-wrap 90 -m -raw -preserve -f /dev/null -asxml

As a result &reg; &trade; and &copy; are substituted wrongly if I use 
iso-8859-2 character set. When using utf-8, almost all entities 
(&aacute; etc.) show up up weirdly. The tidy help says:

 -raw                output values above 127 without conversion to entities

But the entities are actually substituted. If I disable tidy, everything 
is fine, but I need tidy to make the sources W3C valid. There are 
specific things (like custom SGML DOCTYPE declaration) that should be 
processed by tidy for validity. Could somebody tell me what I'm doing 
wrong? I can publish the SGML sources on demand or provide a log about 
the processing of those files.
The version that I use:
[root@server /usr/www/en]# tidy -v
HTML Tidy for FreeBSD released on 1 September 2005
I also experienced the same with the current CVS version.
Thanks in advance.

-- 
Cheers,

Gabor

P.S.: Please CC me when reply, since I'm not subscribed to the list.

Received on Sunday, 1 October 2006 03:09:12 UTC