Re: Microsoft Word from Office 2000 `HTML' fails to validate

On 22 May, Thanasis Kinias wrote:
> scripsit John Murdie:
>> Surely this is a FAQ, but I've just found that the `HTML' output of
>> Microsoft Word doesn't validate with either the W3C or WDG validators:
> 
> You are correct.  Microsoft Word does not output valid HTML, nor does
> any Microsoft product of which I am aware.
> 
> There used to be a program called the "demoronizer" which would clean up
> MSHTML to create something approximating valid HTML, but I don't know if
> it has kept up with recent versions of MS Office.  The best way to get
> valid HTML from MS Word files is to save as plain text (ASCII or
> Unicode) and add the markup by hand.
> 

Thanks, Thanasis. Yes, I'd already found the `Demoroniser'
(http://www.fourmilab.ch/webtools/demoroniser/) but haven't yet tried it
out; its web page mentions several small-scale fixes it applies to
Microsoft `HTML', but does it also cope with the apparent non-conformity
of the document declarations? After all, such files commence:

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
...

which isn't anything I recognise.

--  
John A. Murdie
Experimental Officer (Software)
Department of Computer Science
University of York
England

Received on Wednesday, 22 May 2002 12:38:00 UTC