Re: Are the public HTML DTDs valid XML? from Shinichi Matsui on 2001-12-07 (www-html@w3.org from December 2001)

From: Shinichi Matsui <matsui@isl.mei.co.jp>
Date: Fri, 7 Dec 2001 11:13:10 +0900
To: "Christian Wolfgang Hujer" <Christian.Hujer@itcqis.com>, "Ken Klose" <ken.klose@imedium.com>, <www-html@w3.org>
Message-ID: <005b01c17ec4$b9ff7160$be0cb684@isl.mei.co.jp>
Hi Christian,

Thank you very much for your kind and detailed tutorial. Although
there are some points I cannot fully agree, but I think this is very
useful as a whole.

One thing I want to point out is that W3C Recommendation
"XHTML Basic" does not have a version number. Therefore,
"XHTML Basic 1.0" is not correct, though the DOCTYPE
declaration you wrote (which has "1.0" and "10") is not wrong.

<http://www.w3.org/TR/xhtml-basic/>

I (as one of the XHTML Basic editors) appreciate that you prefer
XHTML Basic. :-)

Regards,

Shinichi Matsui
matsui@isl.mei.co.jp

"Christian Wolfgang Hujer" <Christian.Hujer@itcqis.com> wrote:


> Hello Ken,
>
>
> as Arnold already said, HTML is well-formed SGML, but no XML.
>
>
> If you want to validate your pages using XML parsers, you need to use
XHTML
> instead of HTML. HTML itself won't be developed any further, anyway,
except
> for possible bug-fixes in the latest non-XML HTML version, which is HTML
> 4.01, the one you used.
>
>
> To migrate from HTML to XHTML, the future of HTML, follow these simple
> rules:
>
> 0. Terms
> element: something like <body>...</body> or <p>...</p>
> attribute: something like src="..." in <img src="..." />
> tag: start tag, end tag or empty element tag
> start tag: <body> or <table border="border">
> end tag: </body>
> empty element tag: <hr /> or <br clear="all" />
>
> 1. Never omit tags.
> The following is valid HTML, but not valid XHTML:
> <title>My first HTML document</title>
> Hello world!
>
> 2. All names of elements and attributes are lowercase. Write <html>
instead
> of <HTML>. This does not apply for <!DOCTYPE since that is not a HTML
> element but an SGML / XML instruction.
>
> 3. Always use quotes for attributes.
> Do not write <body bgcolor=white>, write <body bgcolor="white">
>
> 4. Always close elements
> Do not write <p>First paragraph<p>Second paragraph, write <p>First
> paragraph</p><p>Second paragraph</p>
>
> 5. Even close empty elements
> Do not write <br>, write empty element tags like this: <br />. Formally
you
> could also write <br></br>, but most browsers will do two newlines then,
and
> you also could write <br/>, but Netscape Navigator then won't do any
> newline. So write <br />, <hr /> <img src="..." /> and so on.
>
> 6. Attributes always have values
> Even "boolean" attributes. write <hr noshade="noshade" /> or <td
> nowrap="nowrap" /> if you use them at all.
>
> 7. Use the appropriate doctype declaration
> Use one of the following:
>
> For XHTML 1.0 Strict, the XML version of HTML 4.01 Strict:
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
>
> For XHTML 1.0 Transitional, the XML version of HTML 4.01 Transitional:
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>
> For XHTML 1.0 Frameset, the XML version of HTML 4.01 Frameset:
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
>
> For XHTML Basic 1.0, which is a quite device independant version of HTML,
> use:
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
> "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
>
> For XHTML 1.1, which is the successor of XHTML 1.0 Strict (Frameset and
> Transitional are not supported anymore):
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
> "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
>
> I personally prefer XHTML Basic 1.0 for most sites, sometimes I use XHTML
> 1.1, rarely XHTML 1.0.
>
>
> 8. Use the appropriate character encoding declaration
> If you only use ASCII characters (those with representation numbers less
or
> equal than 127) you are not required to declare anything.
> If you use some legacy encoding, you have to declare it for XML and, if it
> is not ISO-8859-1, for HTML, like this:
> <?xml version="1.0" encoding="iso-8859-2"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
> "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
> <html xml:lang="pl" xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title>Polski Dokument</title>
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" />
> </head>
> <body>
> Polski Dokument using polisch characters encoded not with character
> entities but in iso-8859-2 ("Eastern Latin 1").
> </body>
> </html>
>
> If you use UTF-8, declare it for old browsers:
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
> "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
> <html xml:lang="de" xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title>Deutsches Dokument</title>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> </head>
> <body>
> Deutsches Dokument using german characters encoded not in character
> entities but in UTF-8.
> </body>
> </html>
>
> I recommend the use of ASCII only and encoding all Unicode characters with
a
> character number greater than 159 (128 to 159 are of no interest, they are
> control characters and may not be used in XML documents anyway) using
their
> correspondig character entities, e.g. &uuml; for the German u Umlaut or
> &#260; for the Polish A with "ogonek".
>
>
> 9. Do not use CDATA-sections
>
> They are the new form for encoding scripts and style sheets,
> <style type="text/css"><![CDATA[
> body {background:white;color:black;}
> ]]></style>
> is the new way for encoding scripts and style sheets, but do not use it,
> since most browsers have big problems regarding CDATA-sections.
>
>
> I hope that was helpful.
>
>
> I also recommend the technical reports / recommendations on XML, XHTML
1.0,
> Modularization of XHTML, XHTML 1.1, XHTML Basic 1.0 and the Ruby Module:
> XML: http://www.w3.org/TR/REC-xml
> XHTML 1.0: http://www.w3.org/TR/xhtml1
> XHTML Mod: http://www.w3.org/TR/xhtml-modularization/
> XHTML Basic 1.0: http://www.w3.org/TR/xhtml-basic
> XHTML Ruby: http://www.w3.org/TR/ruby/
> XHTML 1.1: http://www.w3.org/TR/xhtml11/
>
> Explanation:
> XML is what XHTML is based on.
> XHTML 1.0 is the first XML based version of HTML. The recommendation also
> describes way for migration.
> XHTML Mod is a framework for building new versions of HTML. XHTML has been
> split up into several modules which can easily plugged together to create
> individual versions of HTML.
> XHTML Basic 1.0 is the first module based version of HTML, it consists of
> all core modules and some small modules like basic tables and basic forms.
> It is ideal to create device independant HTML.
> XHTML Ruby is the first extension module. It is for ruby annotations,
> something slightly similar to tables for annotating text, especially, but
> not restricted to, asian writings.
> XHTML 1.1 is the successor of XHTML 1.0 strict, it is based on XHTML Mod
and
> includes Ruby.
>
>
> If you have further questions, you know whome to ask, just write to the
list
> or me.
>
>
> Greetings
>
> Christian Hujer
>
> P.S.:
> If you get parse errors on XHTML dtd external subsets when using XHTML
> Modularization, it's your parser, not the documents. These are known bugs
in
> some XML parsers. But as far as I know, Xerces should be okay and work
fine.
>
> P.P.S.: Disclaimer
> No warranty for anything.
>
> > -----Original Message-----
> > From: www-html-request@w3.org [mailto:www-html-request@w3.org]On Behalf
> > Of Ken Klose
> > Sent: Thursday, December 06, 2001 9:09 PM
> > To: www-html@w3.org
> > Subject: Are the public HTML DTDs valid XML?
> >
> >
> > I'm trying to use Xerces (java) to parse the simple HTML document below.
> > I've tried both versions 1.4.4 and 2.0.0b3.
> >
> > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
> >    "http://www.w3.org/TR/html4/strict.dtd">
> > <HTML>
> >    <HEAD>
> >       <TITLE>My first HTML document</TITLE>
> >    </HEAD>
> >    <BODY>
> >       Hello world!
> >    </BODY>
> > </HTML>
> >
> > Both offer a similar error: "[Fatal Error] strict.dtd:81:5: The
> > declaration
> > for the entity "ContentType" must end with '>'".  Looking at the
> > referenced
> > DTDs http://www.w3.org/TR/html4/strict.dtd and
> > http://www.w3.org/TR/html4/HTMLlat1.ent I see numerous ENTITY
declarations
> > with comments intermingled such as:
> >
> > <!ENTITY % ContentType "CDATA"
> >     -- media type, as per [RFC2045]
> >     -->
> >
> > Is this intermingling valid?  If so why would Xerces barf on it?  The
XML
> > 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006) mentions in
section
> > 2.5 Comments that "[comments] may appear within the document type
> > declaration at places allowed by the grammar" but the grammar for entity
> > declarations defined in 4.2 does not include comments between the
> > opening <!
> > and closing >.
> >
> > Any thoughts?
> >
> > Thanks,
> > Ken Klose
> >
>
>
Received on Thursday, 6 December 2001 21:13:31 UTC