Removing selected tags

I'm looking for the version of (or patch) for tidy that is mentioned in the
pending tasks:

  Mark Modrall has extended Tidy to support selectively stripping 
  out listed tags and attributes, see his email of March 14th.

This would help me solve the problem below because it would allow me to
remove all <script> elements and their contents.

I use tidy to convert retrieved html files into xhtml files to which I apply
some XSLT transformations to extract data from these files.  My problem is I
have encountered some files that contain <script> tags that do not comment
out their content.  Here is a simple example that shows my problem
(bomb.html):

<html><body>
<script>
  document.write("<b>bomb</b>");
</script>
</body></html>

After running:
  tidy --add-xml-decl yes --output-xhtml yes --doctype loose --wrap 0
--show-warnings no --quiet yes bomb.html

tidy generates:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<title></title>
</head>
<body>
<script type="text/javascript">
  document.write("<b>bomb<\/b>");
</script>

<b>bomb</b>
</body>
</html>

Now when I load this into IE (with MSXML 3.0 installed in overwrite mode), I
get the following error:

  A name was started with an invalid character. 

  at line 10, character 27
    document.write("<b>bomb<\/b>");

With an error pointing at the "\" in the closing "b" tag.

I'm not sure even why the "\" was added to the closing "b" tag in the first
place but it is causing the XML parser to choke.  Does anyone know the
reason why the "\" is added?

If I cannot find the above requested version of tidy, what else could I do
to make sure that I generate valid XML?

-- 
Kipp E. Howard - Sr. Software Engineer @ CourtLink
kipp.howard@courtlink.com   
(425) 372-1837 or (800) 774-7317 ext 1837 

Received on Friday, 15 June 2001 13:13:54 UTC