RE: XML Tidy? from Ignacio Vazquez-Abrams on 2001-06-19 (html-tidy@w3.org from April to June 2001)

From: Ignacio Vazquez-Abrams <ignacio@openservices.net>
Date: Tue, 19 Jun 2001 17:45:20 -0400 (EDT)
To: <html-tidy@w3.org>
Message-ID: <Pine.LNX.4.33.0106191733450.26671-200000@terbidium.openservices.net>

On Tue, 19 Jun 2001, Reitzel, Charlie wrote:

> I have the same type of thing in mind.  It is the kind of thing you should
> be able to accomplish with a library version of Tidy.
>
> In the meantime, if you preserve the original input, you should be able to
> simply subtract N lines from the _reported_ line numbers  - that is, I am
> assuming, the N lines of header you put in front of the user's input.
>
> take it easy,
> Charlie
>
> -----Original Message-----
> From: Ignacio Vazquez-Abrams [mailto:ignacio@openservices.net]
> Sent: Tuesday, June 19, 2001 10:02 AM
> To: html-tidy@w3.org
> Subject: Re: XML Tidy?
>
>
> On Mon, 18 Jun 2001, Klaus Johannes Rusch wrote:
>
> > In <Pine.LNX.4.33.0106181025230.30759-100000@terbidium.openservices.net>,
> > Ignacio Vazquez-Abrams <ignacio@openservices.net> writes:
> > > I was wondering if there exists any version or variant
> > > or configuration of Tidy which could deal with an XML/HTML
> > > hybrid? More specifically I need to just deal with the
> > > stuff that would appear inside the BODY tag, without adding
> > > the HTML, HEAD, and TITLE tags. I have tried a lot of
> > > configuration options for HTML Tidy, but have had no
> > > success so far.
> >
> > You can either use the -xml option to only process the
> > fragment as an XML fragment, however this will not do
> > any of the usual HTML cleanup.
>
> The problem is that I need to do the HTML cleanup; I need to clean up a
> pseudoHTML document entered by the user, and this document will only contain
> a piece of an HTML page.
>
> > Or, run the fragment through tidy using the -asxml
> > option, then extract everything between <body> and </body>.
>
> While that works for the output stage (Oh no! select="html/body"! The
> horror!:P ), I would also like to provide entry-time verification and
> cleanup of code. Having to search for /line ([0-9]+) / and subtracting when
> displaying errors to the user, while not difficult, is something I'd like to
> avoid.

Submitted for refinement is a patch for doing almost exactly the original
requirements. It defines another configuration option ("body-only") that skips
generation of the <html> node and all header tags in parser.c.

The only problem seems to be that if it actually encounters <html> and/or
header tags, well, it dies a horrible death. If somebody could take a look at
the patch and clean it up that would be much appreciated.

Me take it easy? Riiight... :)

-- 
Ignacio Vazquez-Abrams  <ignacio@openservices.net>

Attachments

TEXT/PLAIN attachment: tidy4aug00-bodyonly.patch

Received on Tuesday, 19 June 2001 18:14:52 UTC