- From: Lee Passey <lee@novomail.net>
- Date: Mon, 21 Apr 2003 09:48:57 -0600
- To: html-tidy@w3.org
- Cc: Sailesh Panchang <sailesh.panchang@deque.com>
Sailesh Panchang wrote: > Hello List, > I just downloaded Tidy and am trying to check how it cleans up htm > file saved from WORD 2000. I am using a simple text file with one > ordered list and one data tablein it. But when I try (from DOS prompt): > > > > > tidy -m test1.htm, > it only displays a bunch of warnings and errors and does not modify > the file. The errors state that Tidy failed to recognize many of the > tags inserted by WORD. > I tried a second file created by WORD 2K that has a few hyperlinks and > a couple of images. That too did not convert and I got a list of errors. The problem is that the files generated by M$Word are not really HTML; they are a Micro$oft-proprietary XML, which just happens to be a superset of XHTML. These files look OK in a browser only because most browsers have been specifically designed to ignore unknown elements and attributes, rather than failing when they are encountered. Tidy has a mode specifically designed to clean M$Word XML files. From the command prompt type: tidy --word-2000 yes [input.htm] > [output.html] Tidy is an extraordinarily flexible program, which means that there are a plethora of command line options. You should carefully review the list of options at http://tidy.sourceforge.net/docs/quickref.html before concluding that Tidy will not do what you want. > The documentation states that in case Tidy encounters errors, the > conversion is unpredictable. So does it mean it is not going to work? There are a number of common HTML coding errors that are simply too ambiguous to be fixed automagically; these errors must be fixed by a human who presumably knows what was intended. When tidy encounters one of these errors it prints an error message identifying the line number where the error occurred, so a human can look at the problem, but normally does not produce _any_ output in these cases. Tidy can be forced to produce output even when it cannot fix the errors by specifying the "--force-output yes" option, but the output will probably not be correct HTML. > Is using the WORD 2.0 filter a more reliable option? No. While part of the flaw in Word 2000 output is the non-standard elements and attributes, Word 2000 is known for producing "bloated" XML. This is due at least in part to the fact that M$Word insists on adding font and class specifications to _every_ paragraph in a file, even when they are all identical. Part of the functions provided by the "--word-2000 yes" option is to strip this potentially irrelevant material from the file. The Word 2000 filter leaves this stuff in the file (although much of this badness can be removed by running tidy with the "--drop-font-tags yes" option, after using the Word 2000 filter). > Thanks, > > Sailesh > > >
Received on Monday, 21 April 2003 11:50:51 UTC