- From: Kipp Howard <khoward@courtlink.com>
- Date: Fri, 15 Jun 2001 10:13:16 -0700
- To: "'html-tidy@w3.org'" <html-tidy@w3.org>
I'm looking for the version of (or patch) for tidy that is mentioned in the pending tasks: Mark Modrall has extended Tidy to support selectively stripping out listed tags and attributes, see his email of March 14th. This would help me solve the problem below because it would allow me to remove all <script> elements and their contents. I use tidy to convert retrieved html files into xhtml files to which I apply some XSLT transformations to extract data from these files. My problem is I have encountered some files that contain <script> tags that do not comment out their content. Here is a simple example that shows my problem (bomb.html): <html><body> <script> document.write("<b>bomb</b>"); </script> </body></html> After running: tidy --add-xml-decl yes --output-xhtml yes --doctype loose --wrap 0 --show-warnings no --quiet yes bomb.html tidy generates: <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content="HTML Tidy, see www.w3.org" /> <title></title> </head> <body> <script type="text/javascript"> document.write("<b>bomb<\/b>"); </script> <b>bomb</b> </body> </html> Now when I load this into IE (with MSXML 3.0 installed in overwrite mode), I get the following error: A name was started with an invalid character. at line 10, character 27 document.write("<b>bomb<\/b>"); With an error pointing at the "\" in the closing "b" tag. I'm not sure even why the "\" was added to the closing "b" tag in the first place but it is causing the XML parser to choke. Does anyone know the reason why the "\" is added? If I cannot find the above requested version of tidy, what else could I do to make sure that I generate valid XML? -- Kipp E. Howard - Sr. Software Engineer @ CourtLink kipp.howard@courtlink.com (425) 372-1837 or (800) 774-7317 ext 1837
Received on Friday, 15 June 2001 13:13:54 UTC