W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2013

Re: Ignore processing instructions

From: Christophe Chenon <christophe.chenon@fr.ibm.com>
Date: Wed, 18 Dec 2013 11:59:58 +0100
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: html-tidy@w3.org
Message-ID: <OF5410E5AC.13EB0217-ONC1257C45.003AB564-C1257C45.003C70F1@fr.ibm.com>
Hi Bjoern, 

Referring to your last sentence in your answer below: 

This might be a more sensible option, but you might have to wait a very 
long time for it.

Can you tell me more ?

I believe this option: "Escape all XML-reserved characters whithin 
pre-declared elements listed in the config. file. " would be very helpful 
to me and probably to many html-tidy users. 

Is there any way to make it happen ?

Cordialement / Best regards


Christophe Chenon, PhD


Innovation ? Terminology ? Quality


French Translation Services Center




+33-1-4941-7283
1,  Place  Jean-Baptiste  Clément


christophe.chenon@fr.ibm.com
 Noisy-le-grand ? 93881 ? France
 

French   Translation   Services   Center


IBM Corporate Multilingual Terminology






From:   Bjoern Hoehrmann <derhoermi@gmx.net>
To:     Christophe Chenon/France/IBM@IBMFR, 
Cc:     html-tidy@w3.org
Date:   06/11/2013 12:42
Subject:        Re: Ignore processing instructions



* Christophe Chenon wrote:
>At some point, my near-XML file contains the following: 
>
><econtext><?idd:break> blah blah... </econtext>
>
>Tidy replaces this with 
>
><econtext></econtext> 
>
>and the whole file is gracefully ended with all necessary closing tags, 
>ignoring the full bulk of interesting data below this point. 

XML processing instructions are closed by `?>` but the code above lacks
the question mark. If there is no `?>` in the file, I guess Tidy will
read to the end of the file and, say, discard the unrecognised content.

You could remove such markup as a pre-processing step, using a regex
replace <![CDATA[...]]>, <!--...-->, <?...?> and <?...> with nothing and
then let Tidy process the document further, assuming you do not need to
access the contents of these constructs or the file does not contain
them. You could also remove only <?...> but then you might hit instances
like <!--<?example>--> which would be a false positive.

>Ideally, this processing instruction ( <?idd:break> ) would be ignored or 

>even suppressed. I don't need it at all. 
>
>Can a new option be created ? I can envision something like heed-procins 
: 
>Yes/No/Suppress

I could imagine an option to recognise SGML-ish PI syntax which would
then convert `<?idd:break>` into `<?idd:break?>` but in your case that
would still not be namespace well-formed due to the colon and as such
many XML processors would choke on it. An option to drop PIs does not
seem generally useful.

>Another option would be to escape whatever coding is contained in some 
>pre-declared elements. Here  any coding in the <econtext> element can be 
>escaped. 

This might be a more sensible option, but you might have to wait a very
long time for it.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 




Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 ?
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A 


picture
(image/gif attachment: 01-part)

picture
(image/gif attachment: 02-part)

picture
(image/gif attachment: 03-part)

picture
(image/gif attachment: 04-part)

picture
(image/gif attachment: 05-part)

Received on Wednesday, 18 December 2013 11:00:44 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:59 UTC