- From: dude <dude@fastmail.ca>
- Date: Fri, 10 Jan 2003 14:00:46 -0500 (EST)
- To: html-tidy@w3.org
- Message-Id: <3E1F185E.00000B.54860@ns.interchange.ca>
Erwin - to utilize the power of Tidy against the putrid HTML(if you can call it that) of Word. You have to use a config file in association with Tidy. I am about 95% successful in cleaning it up with a config file that looks like this: word-2000: yes new-empty-tags: o, head literal-attributes: no indent: auto indent-spaces: 2 tidy-mark: no wrap: 72 markup: yes output-xml: no input-xml: no doctype: omit show-warnings: yes numeric-entities: yes uppercase-tags: no uppercase-attributes: no char-encoding: latin1 clean: yes quote-marks: yes quote-nbsp: yes quote-ampersand: yes break-before-br: no drop-empty-paras: yes drop-font-tags: yes enclose-text: yes write-back: yes new-empty-tags: o, style, head markup: yes error-file: err.txt The only thing that I have found that this will not do is to remove the "smart-tags" that windowsXP puts in there (I assume this is because the word-2000 option was written befor ethe marvelous invention of these dumb smart-tags). Also, my config file removes the "<head>" tags and everything between them. In addition, if you end up downloading the Microsoft clean up tool, it is ok, but is 100% incompatible with office XP, and will not install on anything other than office2000. However, you can extract a file called "filter.exe" from the install files and use it in the command line. I am anxious to try the "force-output bare" option myself. If you have trouble getting the filter.exe file, let me know and i can email it to you. peace, dude in oregon > > Hi Erwin, > > Turns out Microsoft Word produces html that is sub-standard, very > sub-standard in many ways. But there are some configurable > options in Tidy that may help you out, take a look at these. > > Word-2000 > http://tidy.sourceforge.net/docs/quickref.html#word-2000 > force-output > http://tidy.sourceforge.net/docs/quickref.html#force-output bare > http://tidy.sourceforge.net/docs/quickref.html#bare > > Also look at Microsofts own cleaning app > http://office.microsoft.com/downloads/2000/Msohtmf2.aspx > > > I have been working on a custom app to convert Word output to > XHTML and learning alot about what it takes to clean up the junk > and leave behind useful info. > > Cheers > Fred > > On Fri, 10 Jan 2003, Erwin Rollauer wrote: > >> >> I am currently evaluating Ultraedit and noticed the TIDY that >> came with it. I tried it against a simple Micrsoft word 2002 >> "save as html" file and got lots of errors. This is just a headup >> notice on the chance that you have not tried it against >> miscrosoft generated code. >> >> >> Erwin Rollauer >> Senior Systems Analyst >> Information Systems Resources >> McGill University >> 688 Sherbrooke St. West, Suite 500 >> Montreal, QC H3A 3R1 >> Tel: 514 398-5023 ex 00626 >> Fax: 514 398-8252 >> Email: erwin.rollauer@mcgill.ca >> >> _________________________________________________________________ http://fastmail.ca/ - Fast Secure Web Email for Canadians
Received on Friday, 10 January 2003 15:17:51 UTC