W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2005

Re: Configuration question (repost)

From: Lee Passey <lee@novomail.net>
Date: Thu, 14 Apr 2005 14:34:45 -0600
Message-ID: <425ED3E5.3010301@novomail.net>
To: html-tidy@w3.org

Huw Wyn Jones wrote:

> I quote from the documentation. Word-2000 - "This option specifies if 
> Tidy should go to great pains to strip out all the surplus stuff 
> Microsoft Word 2000 inserts when you save Word documents as "Web pages".
> I reckon most people would consider <p style="margin: 0cm 0cm 0pt 
> 36pt;" class="MsoNormal"> to be surplus !! I was hoping that Tidy 
> would get rid of all that stuff.
> My original question was concerning configuration - is there anything 
> more I can do with Tidy to get rid of the surplus Word stuff ? Am I 
> making a mistake with my config file ? Have I missed something ?

Sort of, although you can hardly be blamed for missing it.

The kind of changes that Tidy does to "fix" Microsoft Word's almost-HTML 
output is quite severe -- for example stripping _all_ <span> elements -- 
and could do serious damage to non-MSWord files  (and in fact is 
somewhat too aggressive even for MSWord files). For this reason, Tidy 
won't apply MSWord cleanup to files that aren't clearly MSWord files.

To determine if it's an MSWord file, Tidy first looks to see if there is 
an "xmlns:o" attribute for the <html> element. If there is, it's 
considered to be an MSWord file. If there is not, it looks for a <meta> 
tag that has "generator" as its name, and contains the word "microsoft" 
in its content. If it can't find one of these two indicators it won't 
apply the MSWord cleanup, even if you've used the 'word-2000: yes' option.

I see two possible solutions to your problem. First, because Tidy is 
guaranteed to produce valid XHTML if you have selected that option, you 
could use XSLT to remove MSWord markup you don't like _after_ you have 
Tidy'ed the input. Or you could add the attribute 
"xmlns:o='urn:schemas-microsoft-com:office:office'" to the <html> tag 
that surrounds the pasted text; this _may_ have the effect of causing 
markup that you would rather preserve to be lost.

> Huw
Received on Thursday, 14 April 2005 20:36:47 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:55 UTC