Re: Strange tag: <o:p> from Lee Passey on 2002-10-15 (html-tidy@w3.org from October to December 2002)

From: Lee Passey <lee@www.dysfunctionals.org>
Date: Tue, 15 Oct 2002 03:52:24 -0600
To: html-tidy@w3.org
Message-ID: <3DABE558.1000304@www.dysfunctionals.org>
Bjoern Hoehrmann wrote:

> * Charles Reitzel wrote:
> 
>>>>The "o:p" stand for the p tag inside the o namespace.  Is a namespace by
>>>>that name defined anywhere in the file?  You can recognize it by the xmlns
>>>>prefix.
>>>>Most probably it's a left over from a faulty XSL transformation.
>>>>Without seeing the file it's a bit difficult to tell you why Tidy does not
>>>>mark all of them as errors.  Probably it's the order of the tags that makes
>>>>them faulty or not.  ex. <o:p> inside <body> can be excepted, but outside
>>>><body> not.
>>>>
> 
>>Word2000 will use the o: namespace prefix (o for Office).  Using 
>>--word-2000 yes should also fix the problem.  This option will also clean 
>>up a bunch of the cruft Office puts in HTML files.  Often used with --clean 
>>and --bare.
>>
> 
> Hmm.
> 
>   % tidylib --word-2000 yes
>   <o:p>...</o:p>
>   line 1 column 1 - Warning: <o:p> is not approved by W3C
>   line 1 column 1 - Warning: inserting missing 'title' element
>   ^Z
>   Info: Document content looks like HTML proprietary
>   2 warnings, 0 errors were found!
>   
>   <html>
>   <head>
>     <meta content=
>     "HTML Tidy for Windows (vers 1st October 2002), see www.w3.org"
>     name="generator">
>   
>     <title></title>
>   </head>
>   
>   <body>
>     <o:p>...</o:p>
>   </body>
>   </html>
> 
> Is this really intended? Does it need a proper xmlns:o to work? Or am I
> encountering a bug? Without --word-2000 I get
> 
>   line 1 column 1 - Error: <o:p> is not recognized!
>   line 1 column 1 - Warning: discarding unexpected <o:p>
>   line 1 column 1 - Warning: plain text isn't allowed in <head> elements
>   line 1 column 1 - Warning: inserting missing 'title' element
>   line 1 column 9 - Warning: discarding unexpected </o:p>
> 
> That's even worse. And another one, using --word-2000 I also get
> 
>   Info: Doctype given is "-//W3C//DTD HTML 3.2//EN"
>   Info: Document content looks like HTML 3.2
> 
> Neither is true; the first reminds me to a bug I really thought I had
> fixed and my May 2002 standalone build does not tell me about any given
> document type declaration; Did you miss that bug fix when building the
> library code?
> 
> regards.
> 


Well, I'm not sure if what you're seeing is a bug or not.  It _is_ 
working as implemented.

The problem is that M$Word used 4 new namespaces in its documents: v, o, 
  w, and dt.  Of these four, the only tag I have seen outside an HTML 
comment block is <o:p>.

When Tidy first encounters the word-2000 option during startup, it adds 
<o:p> to its list of acceptable tags so the parser won't puke (you'll 
now get the "not approved" warning).

But when you tell Tidy that the document is a Word document, it doesn't 
really believe you.  Instead, when it gets to the cleanup phase, it 
checks to make sure it really _is_ a word document before applying the 
drastic cleanup measures.  It does this by looking for the "xmlns:o" 
attribute on the <html> tag.  If it cannot find it, it checks for a 
<meta> tag where the name is "generator" and the value contains the 
string "Microsoft".  If at least one of these conditions is not met, the 
Word-2000 cleanup is not performed, and your <o:p> tags are not removed.

I'm not sure this checking within Tidy for a real M$ file is such a good 
idea.  Micro$oft has developed a pluging for Word that produces what 
they call "filtered" html.  It doesn't contain any of the proprietary 
tags that Word otherwise uses, but it is still pretty bloated, with font 
specifications for every paragraph.  It is also lacking the Word 2000 
signature so using the --word-2000 option doesn't provide the drastic 
kind of cleanup the file still needs.
Received on Tuesday, 15 October 2002 06:05:13 UTC