W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2000

Re: Best options for converting Word 2000 => X/HTML => XML?

From: Dave Raggett <dsr@w3.org>
Date: Fri, 24 Mar 2000 11:45:55 -0600
To: Stuart Hungerford <stuart.hungerford@webone.com.au>
Cc: html-tidy@w3.org
Message-ID: <OF2C709BF8.FA6C7362-ON86256888.006EC23B@rfdinc.com>

On Thu, 17 Feb 2000, Stuart Hungerford wrote:

> Hi all,
> I've been experimenting with the output of Word 2000, when using the
> "export to compact HTML" and "save as web page" features.
> What I'd like is to end up with well-formed XML, but the tidy options
> I've been using don't always give me what I'd expect.
> Tidy makes a heroic effort on the giant mess Word produces, but I need
> all attributes to be quoted and no repeated attributes.  For example,
> Word
> seems to produce a lot of :
>         <p class=foo1 ... class=foo2> ... </p>
> Which I need as:
>         <p class="foo1" class2="foo2"> ... </p>
> Has anybody else had any experiences they could share?

Tidy's word-2000 option is draconian and strips out the class, lang
and style attributes, see PurgeAttributes(). It also strips out
width attributes from th and td. This was based upon an inspection
of the markup produced by the save as web page export filter from
Word2000. I figured it would be more cost effective to strip these
out and to later add back in class attributes manually.

I would be interested to get suggestions for improvements.


-- Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
tel/fax: +44 122 578 3011 (or 2521) +44 385 320 444 (mobile)
World Wide Web Consortium (on assignment from HP Labs)
Received on Friday, 24 March 2000 12:47:34 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:47 UTC