Re: Converting HTML fragments to XML

In <F991D4265D6AD4119A1900508BC98E572FD60C@NTEXCL01>, William Bagby <williamb@adone.com> writes:
> I have a block of text which has HTML markup in it.  It is possible that it
> is not strictly valid HTML due to non-escaped special characters such as <,
> >, &, etc.  I would like to make it well-formed XML.
> ...
> 
> Question is, is there a simple way, either from the command-line or within a
> configuration file, to tell Tidy *not* to insert the extra tags?  Or do I
> need to modify the source code to accomplish this?

The easiest way probably is to run the markup through Tidy, then strip 
everything up to the <body> tag, and everything from the </body> tag.

Note this will still give you a <p> tag, depending on your fragments you may be
able to simply discard it, or place some "marker" tag to denote the start of 
your content.

-- 
Klaus Johannes Rusch
KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/

Received on Sunday, 6 May 2001 07:55:27 UTC