- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Wed, 06 Nov 2013 12:41:48 +0100
- To: Christophe Chenon <christophe.chenon@fr.ibm.com>
- Cc: html-tidy@w3.org
* Christophe Chenon wrote: >At some point, my near-XML file contains the following: > ><econtext><?idd:break> blah blah... </econtext> > >Tidy replaces this with > ><econtext></econtext> > >and the whole file is gracefully ended with all necessary closing tags, >ignoring the full bulk of interesting data below this point. XML processing instructions are closed by `?>` but the code above lacks the question mark. If there is no `?>` in the file, I guess Tidy will read to the end of the file and, say, discard the unrecognised content. You could remove such markup as a pre-processing step, using a regex replace <![CDATA[...]]>, <!--...-->, <?...?> and <?...> with nothing and then let Tidy process the document further, assuming you do not need to access the contents of these constructs or the file does not contain them. You could also remove only <?...> but then you might hit instances like <!--<?example>--> which would be a false positive. >Ideally, this processing instruction ( <?idd:break> ) would be ignored or >even suppressed. I don't need it at all. > >Can a new option be created ? I can envision something like heed-procins : >Yes/No/Suppress I could imagine an option to recognise SGML-ish PI syntax which would then convert `<?idd:break>` into `<?idd:break?>` but in your case that would still not be namespace well-formed due to the colon and as such many XML processors would choke on it. An option to drop PIs does not seem generally useful. >Another option would be to escape whatever coding is contained in some >pre-declared elements. Here any coding in the <econtext> element can be >escaped. This might be a more sensible option, but you might have to wait a very long time for it. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Wednesday, 6 November 2013 11:42:16 UTC