Re: extra tags in output from Jany Quintard on 2000-01-20 (html-tidy@w3.org from January to March 2000)

From: Jany Quintard <quintard.j@cgi.fr>
Date: Thu, 20 Jan 2000 17:57:52 +0100 (CET)
To: html-tidy@w3.org
Message-ID: <Pine.LNX.3.96.1000120175720.6688N-100000@pc555zh74.adtools.fr.ibm.com>

On Wed, 19 Jan 2000, Peter Levine wrote:

> Hi,
> 
> When I set output-xml: yes why does the output include <html>, <head>,
> <title> and <body> tags when my original file doesn't include these
> tags?
> 
> I'm using tidy as a last cleanup step after stripping those tags from an
> HTML file. The idea is to get my 'almost' XML' file cleaned up by tidy
> before presenting it to an  XML parser.
> 
> TIA,
> Pete
> 
XML files are SGML files which use a special SGML declaration.
In this declaration, you have the following code :
     FEATURES
         MINIMIZE
             DATATAG NO
             OMITTAG NO

So you are not allowed to omit tags (and elements). Actually, in a SGML
file, are many elements that you do not see, because of OMITTAG sttings.
Anyway, they are present and when a parser builds a tree from your
document, those things are there.
In XML, all must be explicite.
This is why in an XML DTD, you never see the - -, - O, O - that you can
encounter in a more loose SGML DTD. Compare :

SGML : (http://www.w3.org/TR/REC-html40/loose.dtd)
<!ELEMENT OL - - (LI)+                 -- ordered list -->

XML (http://www.w3.org/TR/xhtml1/DTD/transitional.dtd)
<!-- Ordered (numbered) list -->
<!ELEMENT ol (li)+>

The two DTD describe the HTML transitional version 4 the two forms.
You can notice that the use of cases and comments is more strict in the
XML version.

So, if you strip your XML file, I guess the will say it is not valid, even
if it is well formed. Depends on what you intend to do.

Jany.

Received on Thursday, 20 January 2000 11:58:27 UTC