RE: XML Tidy?

Great idea.  I'd try a somewhat different tact, tho.  Rather than mess w/
the parser _or_ your input.  I'd just feed the prospective <body> contents
to Tidy as is.  Let the parser merrily add
<head><title></title></head><body> your stuff! </body>.  Then, if the
--body-only option is set, call  a new PPrintContent() function in pprint.c.
This function will call PPrintTree() for each member of the body->content
node list.

The important point here is you can safely add functionality without
changing w/ the inner workings of the parser.  Even better, it will report
the line numbers correctly for any errors/warnings it emits.

take it as easy as you can stand it (which is qualified by noting that I am
an uptight east coaster.  It's one those "do as I say, not as I do" kind of
things),

Charlie


-----Original Message-----
From: Ignacio Vazquez-Abrams [mailto:ignacio@openservices.net]
Sent: Tuesday, June 19, 2001 5:45 PM
To: html-tidy@w3.org
Subject: RE: XML Tidy?


On Tue, 19 Jun 2001, Reitzel, Charlie wrote:

> I have the same type of thing in mind.  It is the kind 
> of thing you should be able to accomplish with a
> library version of Tidy.
>
> In the meantime, if you preserve the original input,
> you should be able to simply subtract N lines from the
> _reported_ line numbers  - that is, I am assuming, the 
> N lines of header you put in front of the user's
> input.
>
> take it easy,
> Charlie
>
> -----Original Message-----
> From: Ignacio Vazquez-Abrams [mailto:ignacio@openservices.net]
> Sent: Tuesday, June 19, 2001 10:02 AM
> To: html-tidy@w3.org
> Subject: Re: XML Tidy?
>
>
> On Mon, 18 Jun 2001, Klaus Johannes Rusch wrote:
>
> > In
<Pine.LNX.4.33.0106181025230.30759-100000@terbidium.openservices.net>,
> > Ignacio Vazquez-Abrams <ignacio@openservices.net> writes:
> > > I was wondering if there exists any version or variant
> > > or configuration of Tidy which could deal with an 
> > > XML/HTML hybrid? More specifically I need to just 
> > > deal with the stuff that would appear inside the 
> > > BODY tag, without adding the HTML, HEAD, and TITLE
> > > tags. I have tried a lot of configuration options 
> > > for HTML Tidy, but have had no success so far.
> >
> > You can either use the -xml option to only process the
> > fragment as an XML fragment, however this will not do
> > any of the usual HTML cleanup.
>
> The problem is that I need to do the HTML cleanup; I 
> need to clean up a pseudoHTML document entered by the 
> user, and this document will only contain a piece of 
> an HTML page.
>
> > Or, run the fragment through tidy using the -asxml
> > option, then extract everything between <body> and
> > </body>.
>
> While that works for the output stage (Oh no! 
> select="html/body"! The horror!:P ), I would also 
> like to provide entry-time verification and
> cleanup of code. Having to search for /line ([0-9]+)
> / and subtracting when displaying errors to the user,
> while not difficult, is something I'd like to avoid.

Submitted for refinement is a patch for doing almost exactly the original
requirements. It defines another configuration option ("body-only") that
skips generation of the <html> node and all header tags in parser.c.

The only problem seems to be that if it actually encounters <html> and/or
header tags, well, it dies a horrible death. If somebody could take a look
at the patch and clean it up that would be much appreciated.

Me take it easy? Riiight... :)

-- 
Ignacio Vazquez-Abrams  <ignacio@openservices.net>

Received on Tuesday, 19 June 2001 19:58:10 UTC