Re: Tidy on WORD-2k htm files

Sailesh Panchang wrote:

> Hello List,
> I just downloaded Tidy   and am trying to check  how it cleans up htm 
> file saved from WORD 2000. I am using a simple text file with one 
> ordered list and one data tablein it. But when I try (from DOS prompt):
>  
> 
>  
> 
> tidy -m test1.htm,
> it only displays a bunch of   warnings and errors and does not modify 
> the file. The errors state that Tidy failed to recognize   many of the 
> tags inserted by WORD.
> I tried a second  file created by WORD 2K that has a few hyperlinks and 
> a couple of images. That too did not convert and I got a list of errors.


The problem is that the files generated by M$Word are not really HTML; they 
are a Micro$oft-proprietary XML, which just happens to be a superset of XHTML. 
These files look OK in a browser only because most browsers have been 
specifically designed to ignore unknown elements and attributes, rather than 
failing when they are encountered.

Tidy has a mode specifically designed to clean M$Word XML files. From the 
command prompt type:

tidy --word-2000 yes [input.htm] > [output.html]

Tidy is an extraordinarily flexible program, which means that there are a 
plethora of command line options. You should carefully review the list of 
options at http://tidy.sourceforge.net/docs/quickref.html before concluding 
that Tidy will not do what you want.


> The documentation states that  in case Tidy encounters errors, the 
> conversion is unpredictable. So does it mean it is not going to work?


There are a number of common HTML coding errors that are simply too ambiguous 
to be fixed automagically; these errors must be fixed by a human who 
presumably knows what was intended. When tidy encounters one of these errors 
it prints an error message identifying the line number where the error 
occurred, so a human can look at the problem, but normally does not produce 
_any_ output in these cases.

Tidy can be forced to produce output even when it cannot fix the errors by 
specifying the "--force-output yes" option, but the output will probably not 
be correct HTML.


> Is using the WORD 2.0 filter  a more reliable option?


No. While part of the flaw in Word 2000 output is the non-standard elements 
and attributes, Word 2000 is known for producing "bloated" XML. This is due at 
least in part to the fact that M$Word insists on adding font and class 
specifications to _every_ paragraph in a file, even when they are all 
identical. Part of the functions provided by the "--word-2000 yes" option is 
to strip this potentially irrelevant material from the file. The Word 2000 
filter leaves this stuff in the file (although much of this badness can be 
removed by running tidy with the "--drop-font-tags yes" option, after using 
the Word 2000 filter).


> Thanks,
> 
> Sailesh
> 
>  
> 

Received on Monday, 21 April 2003 11:50:51 UTC