- From: Lee Passey <lee@novomail.net>
- Date: Wed, 07 Jul 2004 14:49:35 -0600
- To: Paul Reger <paulr@olivetree.com>
- Cc: html-tidy@w3.org
Paul Reger wrote: >Hi, > >I am a new user of Tidy. I wish to use it as the basis for a parser of HTML documents. The parser will be part of a conversion tool to convert from HTML to another markup language that is proprietary to our company.. > >I have some questions, and any help lent would be most appreciated. If you could point me at documents or other code, that would be most helpful. > >Tidy is reporting errors in a sample file that I am feeding it. When I use the -xml switch, tidy reports the document with 4 errors and w/o the -xml switch, tidy reports the document has 1,481 errors. > >When I do not include the -xml switch, tidy reports this one error (several times): > >line 1275 column 7 - Error: <o:p> is not recognized! > You are apparently trying to fix up the bloated, proprietary XHTML variant produced by Micro$oft Word (the <o:p> tag is one of Micro$oft's proprietary additions). Try adding the option "word-2000: yes" to your config file, or adding "--word-2000 yes" to the command line. Be aware that Tidy is quite agressive when cleaning up after M$Word, so you may lose some markup that you wanted to keep. >When I do include the -xml switch, tidy reports the following 4 errors: > >line 1268 column 1 - Error: unexpected </head> in <link> >line 15717 column 1 - Error: unexpected </div> in <hr> >line 15719 column 1 - Error: unexpected </body> in <hr> >line 15721 column 1 - Error: unexpected </html> in <hr> > The -xml switch tells Tidy that the input is well-formed XML. XML requires that all tags be closed, even those which HTML defines as 'empty' tags, which includes <link> and <hr>. To make Tidy accept your input you must either omit the -xml switch, or edit the input file to close the 'empty' tags; for example <hr /> or <link type="text/css" rel="stylesheet" href="styles.css" />. The space before the slash character is not required by XML, but it does no harm and some older browsers get confused if it is not there. > >Thanks in advance for any help, > HTH.
Received on Wednesday, 7 July 2004 16:50:36 UTC