W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2013

Re: Ignore processing instructions

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 06 Nov 2013 12:41:48 +0100
To: Christophe Chenon <christophe.chenon@fr.ibm.com>
Cc: html-tidy@w3.org
Message-ID: <li9k795onfsp447ot9ecc3tqm6dgfso3sm@hive.bjoern.hoehrmann.de>
* Christophe Chenon wrote:
>At some point, my near-XML file contains the following:  
>
><econtext><?idd:break> blah blah... </econtext>
>
>Tidy replaces this with 
>
><econtext></econtext> 
>
>and the whole file is gracefully ended with all necessary closing tags, 
>ignoring the full bulk of interesting data below this point. 

XML processing instructions are closed by `?>` but the code above lacks
the question mark. If there is no `?>` in the file, I guess Tidy will
read to the end of the file and, say, discard the unrecognised content.

You could remove such markup as a pre-processing step, using a regex
replace <![CDATA[...]]>, <!--...-->, <?...?> and <?...> with nothing and
then let Tidy process the document further, assuming you do not need to
access the contents of these constructs or the file does not contain
them. You could also remove only <?...> but then you might hit instances
like <!--<?example>--> which would be a false positive.

>Ideally, this processing instruction ( <?idd:break> ) would be ignored or 
>even suppressed. I don't need it at all. 
>
>Can a new option be created ? I can envision something like heed-procins : 
>Yes/No/Suppress

I could imagine an option to recognise SGML-ish PI syntax which would
then convert `<?idd:break>` into `<?idd:break?>` but in your case that
would still not be namespace well-formed due to the colon and as such
many XML processors would choke on it. An option to drop PIs does not
seem generally useful.

>Another option would be to escape whatever coding is contained in some 
>pre-declared elements. Here  any coding in the <econtext> element can be 
>escaped. 

This might be a more sensible option, but you might have to wait a very
long time for it.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Wednesday, 6 November 2013 11:42:16 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:59 UTC