Re: tidy is not happy tidying up html generated by Microsoft word

Erwin -  to utilize the power of Tidy against the putrid HTML(if you 
can call it that) of Word.  You have to use a config file in 
association with Tidy.  I am about 95% successful in cleaning it up 
with a config file that looks like this:

word-2000: yes
new-empty-tags: o, head
literal-attributes: no
indent: auto
indent-spaces: 2
tidy-mark: no
wrap: 72
markup: yes
output-xml: no
input-xml: no
doctype: omit
show-warnings: yes
numeric-entities: yes
uppercase-tags: no
uppercase-attributes: no
char-encoding: latin1
clean: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: yes
break-before-br: no
drop-empty-paras: yes
drop-font-tags: yes
enclose-text: yes
write-back: yes
new-empty-tags: o, style, head
markup: yes
error-file: err.txt


The only thing that I have found that this will not do is to remove 
the "smart-tags" that windowsXP puts in there (I assume this is 
because the word-2000 option was written befor ethe marvelous 
invention of these dumb smart-tags). Also, my config file removes 
the "<head>" tags and everything between them.

In addition, if you end up downloading the Microsoft clean up tool, 
it is ok, but is 100% incompatible with office XP, and will not 
install on anything other than office2000.  However, you can extract 
a file called "filter.exe" from the install files and use it in the 
command line.

I am anxious to try the "force-output bare" option myself.

If you have trouble getting the filter.exe file, let me know and i 
can email it to you.

peace,
dude in oregon


> 
> Hi Erwin,
> 
> Turns out Microsoft Word produces html that is sub-standard, very
> sub-standard in many ways.  But there are some configurable
> options in Tidy that may help you out, take a look at these.
> 
> Word-2000 
> http://tidy.sourceforge.net/docs/quickref.html#word-2000
> force-output 
> http://tidy.sourceforge.net/docs/quickref.html#force-output bare 
> http://tidy.sourceforge.net/docs/quickref.html#bare
> 
> Also look at Microsofts own cleaning app
> http://office.microsoft.com/downloads/2000/Msohtmf2.aspx
> 
> 
> I have been working on a custom app to convert Word output to
> XHTML and learning alot about what it takes to clean up the junk
> and leave behind useful info.
> 
> Cheers
> Fred
> 
> On Fri, 10 Jan 2003, Erwin Rollauer wrote:
> 
>> 
>> I am currently evaluating Ultraedit and noticed the TIDY that
>> came with it. I tried it against a simple Micrsoft word 2002
>> "save as html" file and got lots of errors. This is just a headup
>> notice on the chance that you have not tried it against
>> miscrosoft generated code.
>> 
>> 
>> Erwin Rollauer
>> Senior Systems Analyst
>> Information Systems Resources
>> McGill University
>> 688 Sherbrooke St. West, Suite 500
>> Montreal, QC   H3A 3R1
>> Tel:   514 398-5023 ex 00626
>> Fax:   514 398-8252
>> Email: erwin.rollauer@mcgill.ca
>> 
>> 

_________________________________________________________________
    http://fastmail.ca/ - Fast Secure Web Email for Canadians

Received on Friday, 10 January 2003 15:17:51 UTC