- From: dude <dude@fastmail.ca>
- Date: Fri, 10 Jan 2003 14:00:46 -0500 (EST)
- To: html-tidy@w3.org
- Message-Id: <3E1F185E.00000B.54860@ns.interchange.ca>
Erwin - to utilize the power of Tidy against the putrid HTML(if you
can call it that) of Word. You have to use a config file in
association with Tidy. I am about 95% successful in cleaning it up
with a config file that looks like this:
word-2000: yes
new-empty-tags: o, head
literal-attributes: no
indent: auto
indent-spaces: 2
tidy-mark: no
wrap: 72
markup: yes
output-xml: no
input-xml: no
doctype: omit
show-warnings: yes
numeric-entities: yes
uppercase-tags: no
uppercase-attributes: no
char-encoding: latin1
clean: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: yes
break-before-br: no
drop-empty-paras: yes
drop-font-tags: yes
enclose-text: yes
write-back: yes
new-empty-tags: o, style, head
markup: yes
error-file: err.txt
The only thing that I have found that this will not do is to remove
the "smart-tags" that windowsXP puts in there (I assume this is
because the word-2000 option was written befor ethe marvelous
invention of these dumb smart-tags). Also, my config file removes
the "<head>" tags and everything between them.
In addition, if you end up downloading the Microsoft clean up tool,
it is ok, but is 100% incompatible with office XP, and will not
install on anything other than office2000. However, you can extract
a file called "filter.exe" from the install files and use it in the
command line.
I am anxious to try the "force-output bare" option myself.
If you have trouble getting the filter.exe file, let me know and i
can email it to you.
peace,
dude in oregon
>
> Hi Erwin,
>
> Turns out Microsoft Word produces html that is sub-standard, very
> sub-standard in many ways. But there are some configurable
> options in Tidy that may help you out, take a look at these.
>
> Word-2000
> http://tidy.sourceforge.net/docs/quickref.html#word-2000
> force-output
> http://tidy.sourceforge.net/docs/quickref.html#force-output bare
> http://tidy.sourceforge.net/docs/quickref.html#bare
>
> Also look at Microsofts own cleaning app
> http://office.microsoft.com/downloads/2000/Msohtmf2.aspx
>
>
> I have been working on a custom app to convert Word output to
> XHTML and learning alot about what it takes to clean up the junk
> and leave behind useful info.
>
> Cheers
> Fred
>
> On Fri, 10 Jan 2003, Erwin Rollauer wrote:
>
>>
>> I am currently evaluating Ultraedit and noticed the TIDY that
>> came with it. I tried it against a simple Micrsoft word 2002
>> "save as html" file and got lots of errors. This is just a headup
>> notice on the chance that you have not tried it against
>> miscrosoft generated code.
>>
>>
>> Erwin Rollauer
>> Senior Systems Analyst
>> Information Systems Resources
>> McGill University
>> 688 Sherbrooke St. West, Suite 500
>> Montreal, QC H3A 3R1
>> Tel: 514 398-5023 ex 00626
>> Fax: 514 398-8252
>> Email: erwin.rollauer@mcgill.ca
>>
>>
_________________________________________________________________
http://fastmail.ca/ - Fast Secure Web Email for Canadians
Received on Friday, 10 January 2003 15:17:51 UTC