W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2004

Re: Help with tidy?

From: Lee Passey <lee@novomail.net>
Date: Wed, 07 Jul 2004 14:49:35 -0600
Message-ID: <40EC61DF.70107@novomail.net>
To: Paul Reger <paulr@olivetree.com>
Cc: html-tidy@w3.org

Paul Reger wrote:

>I am a new user of Tidy.  I wish to use it as the basis for a parser of HTML documents.  The parser will be part of a conversion tool to convert from HTML to another markup language that is proprietary to our company..
>I have some questions, and any help lent would be most appreciated.  If you could point me at documents or other code, that would be most helpful.
>Tidy is reporting errors in a sample file that I am feeding it.  When I use the -xml switch, tidy reports the document with 4 errors and w/o the -xml switch, tidy reports the document has 1,481 errors.
>When I do not include the -xml switch, tidy reports this one error (several times):
>line 1275 column 7 - Error: <o:p> is not recognized!
You are apparently trying to fix up the bloated, proprietary XHTML 
variant produced by Micro$oft Word (the <o:p> tag is one of Micro$oft's 
proprietary additions). Try adding the option  "word-2000: yes" to your 
config file, or adding "--word-2000  yes" to the command line. Be aware 
that Tidy is quite agressive when cleaning up after M$Word, so you may 
lose some markup that you wanted to keep.

>When I do include the -xml switch, tidy reports the following 4 errors:
>line 1268 column 1 - Error: unexpected </head> in <link>
>line 15717 column 1 - Error: unexpected </div> in <hr>
>line 15719 column 1 - Error: unexpected </body> in <hr>
>line 15721 column 1 - Error: unexpected </html> in <hr>
The -xml switch tells Tidy that the input is well-formed XML. XML 
requires that all tags be closed, even those which HTML defines as 
'empty' tags, which includes <link> and <hr>. To make Tidy accept your 
input you must either omit the -xml switch, or edit the input file to 
close the 'empty' tags; for example <hr /> or  <link type="text/css" 
rel="stylesheet" href="styles.css" />. The space before the slash 
character is not required by XML, but it does no harm and some older 
browsers get confused if it is not there.

>Thanks in advance for any help,
Received on Wednesday, 7 July 2004 16:50:36 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:55 UTC