W3C home > Mailing lists > Public > www-validator@w3.org > May 2002

Re: Microsoft Word from Office 2000 `HTML' fails to validate

From: John Murdie <john@cs.york.ac.uk>
Date: Wed, 22 May 2002 17:34:13 +0100 (BST)
To: www-validator@w3.org
cc: John Murdie <john@cs.york.ac.uk>
Message-Id: <E17AZ47-0005P0-00@minster.cs.york.ac.uk>
On 22 May, Thanasis Kinias wrote:
> scripsit John Murdie:
>> Surely this is a FAQ, but I've just found that the `HTML' output of
>> Microsoft Word doesn't validate with either the W3C or WDG validators:
> You are correct.  Microsoft Word does not output valid HTML, nor does
> any Microsoft product of which I am aware.
> There used to be a program called the "demoronizer" which would clean up
> MSHTML to create something approximating valid HTML, but I don't know if
> it has kept up with recent versions of MS Office.  The best way to get
> valid HTML from MS Word files is to save as plain text (ASCII or
> Unicode) and add the markup by hand.

Thanks, Thanasis. Yes, I'd already found the `Demoroniser'
(http://www.fourmilab.ch/webtools/demoroniser/) but haven't yet tried it
out; its web page mentions several small-scale fixes it applies to
Microsoft `HTML', but does it also cope with the apparent non-conformity
of the document declarations? After all, such files commence:

<html xmlns:o="urn:schemas-microsoft-com:office:office"

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">

which isn't anything I recognise.

John A. Murdie
Experimental Officer (Software)
Department of Computer Science
University of York
Received on Wednesday, 22 May 2002 12:38:00 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:58:27 UTC