RE: XHTML from Microsoft Word

Bob Boiko (author of Content Management Bible) taught a Content
Management course where we managed content in Word documents and published
them to xml, html, and other word documents.  The very interesting
technique he uses to create xml from Word documents is to write custom
conversions using VBA & Word macros.  This works well for simple word docs
and is relatively easy to do.  However, things get a bit trickier if the
document has embedded images or complex tables in it.

Other products to look at would be those by Stellent, who owns over 90% of
the conversion technologies in the market. (Or so they claim.)  One of
their products that I have used, Stellent Site Builder (which is part of
their Content Publisher package), is designed to allow individuals to
convert from many different file formats, including Word, Excel,
Powerpoint, Notes, and others, into html.  How the output gets formatted
is somewhat customizable but I'm not certain that it can produce valid
XHTML.  Nonetheless, it might be worth a look.  (Note, the Site Builder
product uses their proprietary Inside Out conversion technology... it may
be possible to obtain just the conversion engine.  Publisher and Site
Builder are intended to complement their content management system, but
can be used just for conversion if desired.)  www.stellent.com

Unfortunately, both of these solutions require MS Word to be present on
the machine doing the conversion, which would count out any *nix OS.
One alternative, though I don't think it would be horribly reasonable,
would be to save the Word docs as RTF and use something like Perl to
convert to html.

On a brighter note, the next version of Word is supposed to fully
integrate XML, meaning that you could save to XML, which is much easier to
work with.  That ought to be a lifesaver for many people!

Philip Lanier
Senior, Informatics
University of Washington






On Tue, 27 May 2003, Jon Hanna wrote:

>
> > Is anyone aware of a tool (preferably something that will run under a
> > Unix-ish operating system) that can take the HTML created by Microsoft
> > Word and turn it into clean, Accessible XHTML?
> >
> > My application is for a forum where agendae and minutes of meetings are
> > recorded and posted to an otherwise-Accessible site.
>
> I put a challenge before another list some time ago where I promised to
> donate to the charity of choice of the person who managed to get valid HTML
> out of Word. Nobody claimed the bet (although the most recent Mac version of
> Word came pretty close) but some were able to get very good results by
> putting their output through HTMLTidy
> <http://www.w3.org/People/Raggett/tidy/>. If the original Word doc is pretty
> simple then there should be little or no accessibility problems remaining to
> deal with by hand.
>
>

Received on Tuesday, 27 May 2003 12:04:58 UTC