RE: Are the public HTML DTDs valid XML? from Christian Wolfgang Hujer on 2001-12-07 (www-html@w3.org from December 2001)

From: Christian Wolfgang Hujer <Christian.Hujer@itcqis.com>
Date: Fri, 7 Dec 2001 12:27:11 +0100
To: "Shinichi Matsui" <matsui@isl.mei.co.jp>, "Ken Klose" <ken.klose@imedium.com>, <www-html@w3.org>
Message-ID: <000201c17f12$1cd2cb00$d02750d9@andromedacwh>
Hi Shinichi,

> -----Original Message-----
> Hi Christian,
>
> Thank you very much for your kind and detailed tutorial. Although
> there are some points I cannot fully agree, but I think this is very
> useful as a whole.
>
> One thing I want to point out is that W3C Recommendation
> "XHTML Basic" does not have a version number. Therefore,
> "XHTML Basic 1.0" is not correct, though the DOCTYPE
> declaration you wrote (which has "1.0" and "10") is not wrong.
>
> <http://www.w3.org/TR/xhtml-basic/>
>
> I (as one of the XHTML Basic editors) appreciate that you prefer
> XHTML Basic. :-)
I am sorry.

The reason why I wrote the wrong name "XHTML Basic 1.0" is that I regularly
write that DOCTYPE declaration, but only rarely read the recommendation
again. But XHTML Basic really is the true name. It's enough to read the
heading of the recommendation to know that. Shame on me.

I now am interested in the reason *why* it is XHTML Basic and not "XHTML
Basic 1.0". I guess it is because that quite device independant "subset" of
HTML is, on one hand, based on XHTML Mod and therefore (parts of) HTML 4.01,
so it includes about 10 years of experience, and on the other hand, it shall
not be extended in the next years anyway, except of custom module
extensions?

And I want to know which are the other points you cannot fully agree. You
wrote *some* points, so I guess it's more than one.

But I hope you won't list the points that are XHTML Basic related, like <hr
/>, which doesn't exist in XHTML Basic, since my intention was to write
about XHTML in common, not specifically XHTML Basic ;)

Perhaps I should rewrite the "tutorial" in a more precise way and tell more
about the differences between XHTML Basic, XHTML 1.0 and XHTML 1.1. What do
you think?


Okay, perhaps I should have written "do not use them *yet*" in section 9
about CDATA sections.

But as the disclaimer said:
> > No warranty for anything.


Regards and thanks for your correction
(it's good to be good but it's better to be better)

Christian


> Regards,
>
> Shinichi Matsui
> matsui@isl.mei.co.jp
>
> "Christian Wolfgang Hujer" <Christian.Hujer@itcqis.com> wrote:
>
>
> > Hello Ken,
> >
> >
> > as Arnold already said, HTML is well-formed SGML, but no XML.
> >
> >
> > If you want to validate your pages using XML parsers, you need to use
> XHTML
> > instead of HTML. HTML itself won't be developed any further, anyway,
> except
> > for possible bug-fixes in the latest non-XML HTML version, which is HTML
> > 4.01, the one you used.
> >
> >
> > To migrate from HTML to XHTML, the future of HTML, follow these simple
> > rules:
> >
> > 0. Terms
> > element: something like <body>...</body> or <p>...</p>
> > attribute: something like src="..." in <img src="..." />
> > tag: start tag, end tag or empty element tag
> > start tag: <body> or <table border="border">
> > end tag: </body>
> > empty element tag: <hr /> or <br clear="all" />
> >
> > 1. Never omit tags.
> > The following is valid HTML, but not valid XHTML:
> > <title>My first HTML document</title>
> > Hello world!
> >
> > 2. All names of elements and attributes are lowercase. Write <html>
> instead
> > of <HTML>. This does not apply for <!DOCTYPE since that is not a HTML
> > element but an SGML / XML instruction.
> >
> > 3. Always use quotes for attributes.
> > Do not write <body bgcolor=white>, write <body bgcolor="white">
> >
> > 4. Always close elements
> > Do not write <p>First paragraph<p>Second paragraph, write <p>First
> > paragraph</p><p>Second paragraph</p>
> >
> > 5. Even close empty elements
> > Do not write <br>, write empty element tags like this: <br />. Formally
> you
> > could also write <br></br>, but most browsers will do two newlines then,
> and
> > you also could write <br/>, but Netscape Navigator then won't do any
> > newline. So write <br />, <hr /> <img src="..." /> and so on.
> >
> > 6. Attributes always have values
> > Even "boolean" attributes. write <hr noshade="noshade" /> or <td
> > nowrap="nowrap" /> if you use them at all.
> >
> > 7. Use the appropriate doctype declaration
> > Use one of the following:
> >
> > For XHTML 1.0 Strict, the XML version of HTML 4.01 Strict:
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> >
> > For XHTML 1.0 Transitional, the XML version of HTML 4.01 Transitional:
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> >
> > For XHTML 1.0 Frameset, the XML version of HTML 4.01 Frameset:
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
> >
> > For XHTML Basic 1.0, which is a quite device independant
> version of HTML,
> > use:
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
> > "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
> >
> > For XHTML 1.1, which is the successor of XHTML 1.0 Strict (Frameset and
> > Transitional are not supported anymore):
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
> > "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
> >
> > I personally prefer XHTML Basic 1.0 for most sites, sometimes I
> use XHTML
> > 1.1, rarely XHTML 1.0.
> >
> >
> > 8. Use the appropriate character encoding declaration
> > If you only use ASCII characters (those with representation numbers less
> or
> > equal than 127) you are not required to declare anything.
> > If you use some legacy encoding, you have to declare it for XML
> and, if it
> > is not ISO-8859-1, for HTML, like this:
> > <?xml version="1.0" encoding="iso-8859-2"?>
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
> > "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
> > <html xml:lang="pl" xmlns="http://www.w3.org/1999/xhtml">
> > <head>
> > <title>Polski Dokument</title>
> > <meta http-equiv="Content-Type" content="text/html;
> charset=iso-8859-2" />
> > </head>
> > <body>
> > Polski Dokument using polisch characters encoded not with character
> > entities but in iso-8859-2 ("Eastern Latin 1").
> > </body>
> > </html>
> >
> > If you use UTF-8, declare it for old browsers:
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
> > "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
> > <html xml:lang="de" xmlns="http://www.w3.org/1999/xhtml">
> > <head>
> > <title>Deutsches Dokument</title>
> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> > </head>
> > <body>
> > Deutsches Dokument using german characters encoded not in character
> > entities but in UTF-8.
> > </body>
> > </html>
> >
> > I recommend the use of ASCII only and encoding all Unicode
> characters with
> a
> > character number greater than 159 (128 to 159 are of no
> interest, they are
> > control characters and may not be used in XML documents anyway) using
> their
> > correspondig character entities, e.g. &uuml; for the German u Umlaut or
> > &#260; for the Polish A with "ogonek".
> >
> >
> > 9. Do not use CDATA-sections
> >
> > They are the new form for encoding scripts and style sheets,
> > <style type="text/css"><![CDATA[
> > body {background:white;color:black;}
> > ]]></style>
> > is the new way for encoding scripts and style sheets, but do not use it,
> > since most browsers have big problems regarding CDATA-sections.
> >
> >
> > I hope that was helpful.
> >
> >
> > I also recommend the technical reports / recommendations on XML, XHTML
> 1.0,
> > Modularization of XHTML, XHTML 1.1, XHTML Basic 1.0 and the Ruby Module:
> > XML: http://www.w3.org/TR/REC-xml
> > XHTML 1.0: http://www.w3.org/TR/xhtml1
> > XHTML Mod: http://www.w3.org/TR/xhtml-modularization/
> > XHTML Basic 1.0: http://www.w3.org/TR/xhtml-basic
> > XHTML Ruby: http://www.w3.org/TR/ruby/
> > XHTML 1.1: http://www.w3.org/TR/xhtml11/
> >
> > Explanation:
> > XML is what XHTML is based on.
> > XHTML 1.0 is the first XML based version of HTML. The
> recommendation also
> > describes way for migration.
> > XHTML Mod is a framework for building new versions of HTML.
> XHTML has been
> > split up into several modules which can easily plugged together
> to create
> > individual versions of HTML.
> > XHTML Basic 1.0 is the first module based version of HTML, it
> consists of
> > all core modules and some small modules like basic tables and
> basic forms.
> > It is ideal to create device independant HTML.
> > XHTML Ruby is the first extension module. It is for ruby annotations,
> > something slightly similar to tables for annotating text,
> especially, but
> > not restricted to, asian writings.
> > XHTML 1.1 is the successor of XHTML 1.0 strict, it is based on XHTML Mod
> and
> > includes Ruby.
> >
> >
> > If you have further questions, you know whome to ask, just write to the
> list
> > or me.
> >
> >
> > Greetings
> >
> > Christian Hujer
> >
> > P.S.:
> > If you get parse errors on XHTML dtd external subsets when using XHTML
> > Modularization, it's your parser, not the documents. These are
> known bugs
> in
> > some XML parsers. But as far as I know, Xerces should be okay and work
> fine.
> >
> > P.P.S.: Disclaimer
> > No warranty for anything.
> >
> > > -----Original Message-----
> > > From: www-html-request@w3.org
> [mailto:www-html-request@w3.org]On Behalf
> > > Of Ken Klose
> > > Sent: Thursday, December 06, 2001 9:09 PM
> > > To: www-html@w3.org
> > > Subject: Are the public HTML DTDs valid XML?
> > >
> > >
> > > I'm trying to use Xerces (java) to parse the simple HTML
> document below.
> > > I've tried both versions 1.4.4 and 2.0.0b3.
> > >
> > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
> > >    "http://www.w3.org/TR/html4/strict.dtd">
> > > <HTML>
> > >    <HEAD>
> > >       <TITLE>My first HTML document</TITLE>
> > >    </HEAD>
> > >    <BODY>
> > >       Hello world!
> > >    </BODY>
> > > </HTML>
> > >
> > > Both offer a similar error: "[Fatal Error] strict.dtd:81:5: The
> > > declaration
> > > for the entity "ContentType" must end with '>'".  Looking at the
> > > referenced
> > > DTDs http://www.w3.org/TR/html4/strict.dtd and
> > > http://www.w3.org/TR/html4/HTMLlat1.ent I see numerous ENTITY
> declarations
> > > with comments intermingled such as:
> > >
> > > <!ENTITY % ContentType "CDATA"
> > >     -- media type, as per [RFC2045]
> > >     -->
> > >
> > > Is this intermingling valid?  If so why would Xerces barf on it?  The
> XML
> > > 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006) mentions in
> section
> > > 2.5 Comments that "[comments] may appear within the document type
> > > declaration at places allowed by the grammar" but the grammar
> for entity
> > > declarations defined in 4.2 does not include comments between the
> > > opening <!
> > > and closing >.
> > >
> > > Any thoughts?
> > >
> > > Thanks,
> > > Ken Klose
> > >
> >
> >
>
Received on Friday, 7 December 2001 06:36:42 UTC