Re: persistent xml-decl vs. char-encoding

From: Charles Reitzel <creitzel@rcn.com>
Date: Sun, 02 Feb 2003 17:52:04 -0500
Message-Id: <>
To: Piotr Banski <bansp@venus.ci.uw.edu.pl>
Cc: html-tidy@w3.org

Hi Piotr,

First, I would recommend that you replace the entire XML declaration with 
sed.  Yes, we might fix the bug (not respecting --xml-decl no w/ RAW 
encoding), but probably not in time to meet your needs.  The format is 
highly regular and shouldn't be a problem w/ sed or awk in a shell script.

You are fighting an uphill battle using RAW in the first place.  The next 
major item on Tidy's agenda is to support pluggable character encodings a 
la Expat or LibXml.  That said, have a look at the recent changes to 
support ISO-8859-15.  It might be easier all around do a patch of your own 
to support 8859-2 along the same lines.  Just thinking out loud here.

Second, about the segfault, I found and fixed one in the new diagnostics 
code.  If you are using a Compile Farm executable, it should be there 
tomorrow.  If you are using Windows, I thought I it up after I fixed that 
problem, but let me know, and I'll make sure to put up a fresh build.  If 
the problem remains, please send a sample config and input file.  Thanks.

take it easy,

At 11:02 PM 2/2/2003 +0100, Piotr Banski wrote:

>I'm trying to prevent Tidy from outputting the xml declaration, because I 
>want it to read <?xml version="1.0" encoding="iso-8859-2"?>, and as far as 
>I can see, Tidy won't let me specify this encoding, so I supply the whole 
>line from a shell script. And, of course, setting add-xml-decl to "no" 
>does the job *if* I don't also specify char-encoding as "raw". (I specify 
>it as "raw", to prevent Tidy from mangling Latin-2 characters in the files 
>I process.)
>So, if I use cmdline arguments, I can suppress the declaration when I do e.g.:
>tidy --output-xml yes --add-xml-decl no --tidy-mark no $1 >> $1.xml
>but it stops working if I do:
>tidy --output-xml yes --add-xml-decl no --char-encoding raw $1 >> $1.xml
>To make things even more interesting, let me add that if I specify
>char-encoding as "ascii", it works as it should...
>I get the same behaviour for the versions of 1 Jan and 1 Feb.
>Additionally, the Jan version won't read my config file, apparently, and 
>the Feb version segfaults on the files I need to process (bug report 
>already posted), so I'm somewhat stuck and will gratefully accept some 
>advice :-) I mean, if I have to, I will transcode my files before feeding 
>them to Tidy, but maybe there's something about config options that I've 
>missed, or some upcoming fix only days (hours? ;-) ) away?
>    Piotr
Received on Sunday, 2 February 2003 17:44:03 UTC

