Re: Best options for converting Word 2000 => X/HTML => XML?

On Thu, 17 Feb 2000, Stuart Hungerford wrote:

> Hi all,
> 
> I've been experimenting with the output of Word 2000, when using the
> "export to compact HTML" and "save as web page" features.
> 
> What I'd like is to end up with well-formed XML, but the tidy options
> I've been using don't always give me what I'd expect.
> 
> Tidy makes a heroic effort on the giant mess Word produces, but I need
> all attributes to be quoted and no repeated attributes.  For example,
> Word
> seems to produce a lot of :
> 
>         <p class=foo1 ... class=foo2> ... </p>
> 
> Which I need as:
> 
>         <p class="foo1" class2="foo2"> ... </p>
> 
> Has anybody else had any experiences they could share?

Tidy's word-2000 option is draconian and strips out the class, lang
and style attributes, see PurgeAttributes(). It also strips out
width attributes from th and td. This was based upon an inspection
of the markup produced by the save as web page export filter from
Word2000. I figured it would be more cost effective to strip these
out and to later add back in class attributes manually.

I would be interested to get suggestions for improvements.

Regards,

-- Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
tel/fax: +44 122 578 3011 (or 2521) +44 385 320 444 (mobile)
World Wide Web Consortium (on assignment from HP Labs)

Received on Thursday, 17 February 2000 15:09:10 UTC