RE: Are the public HTML DTDs valid XML? from Christian Wolfgang Hujer on 2001-12-07 (www-html@w3.org from December 2001)

From: Christian Wolfgang Hujer <Christian.Hujer@itcqis.com>
Date: Fri, 7 Dec 2001 01:20:40 +0100
To: "Ken Klose" <ken.klose@imedium.com>, <www-html@w3.org>
Message-ID: <000001c17eb5$00823c40$0895e23e@andromedacwh>
Hello Ken,


as Arnold already said, HTML is well-formed SGML, but no XML.


If you want to validate your pages using XML parsers, you need to use XHTML
instead of HTML. HTML itself won't be developed any further, anyway, except
for possible bug-fixes in the latest non-XML HTML version, which is HTML
4.01, the one you used.


To migrate from HTML to XHTML, the future of HTML, follow these simple
rules:

0. Terms
element: something like <body>...</body> or <p>...</p>
attribute: something like src="..." in <img src="..." />
tag: start tag, end tag or empty element tag
start tag: <body> or <table border="border">
end tag: </body>
empty element tag: <hr /> or <br clear="all" />

1. Never omit tags.
The following is valid HTML, but not valid XHTML:
<title>My first HTML document</title>
Hello world!

2. All names of elements and attributes are lowercase. Write <html> instead
of <HTML>. This does not apply for <!DOCTYPE since that is not a HTML
element but an SGML / XML instruction.

3. Always use quotes for attributes.
Do not write <body bgcolor=white>, write <body bgcolor="white">

4. Always close elements
Do not write <p>First paragraph<p>Second paragraph, write <p>First
paragraph</p><p>Second paragraph</p>

5. Even close empty elements
Do not write <br>, write empty element tags like this: <br />. Formally you
could also write <br></br>, but most browsers will do two newlines then, and
you also could write <br/>, but Netscape Navigator then won't do any
newline. So write <br />, <hr /> <img src="..." /> and so on.

6. Attributes always have values
Even "boolean" attributes. write <hr noshade="noshade" /> or <td
nowrap="nowrap" /> if you use them at all.

7. Use the appropriate doctype declaration
Use one of the following:

For XHTML 1.0 Strict, the XML version of HTML 4.01 Strict:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

For XHTML 1.0 Transitional, the XML version of HTML 4.01 Transitional:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

For XHTML 1.0 Frameset, the XML version of HTML 4.01 Frameset:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

For XHTML Basic 1.0, which is a quite device independant version of HTML,
use:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">

For XHTML 1.1, which is the successor of XHTML 1.0 Strict (Frameset and
Transitional are not supported anymore):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

I personally prefer XHTML Basic 1.0 for most sites, sometimes I use XHTML
1.1, rarely XHTML 1.0.


8. Use the appropriate character encoding declaration
If you only use ASCII characters (those with representation numbers less or
equal than 127) you are not required to declare anything.
If you use some legacy encoding, you have to declare it for XML and, if it
is not ISO-8859-1, for HTML, like this:
<?xml version="1.0" encoding="iso-8859-2"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
<html xml:lang="pl" xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<title>Polski Dokument</title>
		<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" />
	</head>
	<body>
		Polski Dokument using polisch characters encoded not with character
entities but in iso-8859-2 ("Eastern Latin 1").
	</body>
</html>

If you use UTF-8, declare it for old browsers:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
<html xml:lang="de" xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<title>Deutsches Dokument</title>
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	</head>
	<body>
		Deutsches Dokument using german characters encoded not in character
entities but in UTF-8.
	</body>
</html>

I recommend the use of ASCII only and encoding all Unicode characters with a
character number greater than 159 (128 to 159 are of no interest, they are
control characters and may not be used in XML documents anyway) using their
correspondig character entities, e.g. &uuml; for the German u Umlaut or
&#260; for the Polish A with "ogonek".


9. Do not use CDATA-sections

They are the new form for encoding scripts and style sheets,
<style type="text/css"><![CDATA[
	body {background:white;color:black;}
]]></style>
is the new way for encoding scripts and style sheets, but do not use it,
since most browsers have big problems regarding CDATA-sections.


I hope that was helpful.


I also recommend the technical reports / recommendations on XML, XHTML 1.0,
Modularization of XHTML, XHTML 1.1, XHTML Basic 1.0 and the Ruby Module:
XML: http://www.w3.org/TR/REC-xml
XHTML 1.0: http://www.w3.org/TR/xhtml1
XHTML Mod: http://www.w3.org/TR/xhtml-modularization/
XHTML Basic 1.0: http://www.w3.org/TR/xhtml-basic
XHTML Ruby: http://www.w3.org/TR/ruby/
XHTML 1.1: http://www.w3.org/TR/xhtml11/

Explanation:
XML is what XHTML is based on.
XHTML 1.0 is the first XML based version of HTML. The recommendation also
describes way for migration.
XHTML Mod is a framework for building new versions of HTML. XHTML has been
split up into several modules which can easily plugged together to create
individual versions of HTML.
XHTML Basic 1.0 is the first module based version of HTML, it consists of
all core modules and some small modules like basic tables and basic forms.
It is ideal to create device independant HTML.
XHTML Ruby is the first extension module. It is for ruby annotations,
something slightly similar to tables for annotating text, especially, but
not restricted to, asian writings.
XHTML 1.1 is the successor of XHTML 1.0 strict, it is based on XHTML Mod and
includes Ruby.


If you have further questions, you know whome to ask, just write to the list
or me.


Greetings

Christian Hujer

P.S.:
If you get parse errors on XHTML dtd external subsets when using XHTML
Modularization, it's your parser, not the documents. These are known bugs in
some XML parsers. But as far as I know, Xerces should be okay and work fine.

P.P.S.: Disclaimer
No warranty for anything.

> -----Original Message-----
> From: www-html-request@w3.org [mailto:www-html-request@w3.org]On Behalf
> Of Ken Klose
> Sent: Thursday, December 06, 2001 9:09 PM
> To: www-html@w3.org
> Subject: Are the public HTML DTDs valid XML?
>
>
> I'm trying to use Xerces (java) to parse the simple HTML document below.
> I've tried both versions 1.4.4 and 2.0.0b3.
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>    "http://www.w3.org/TR/html4/strict.dtd">
> <HTML>
>    <HEAD>
>       <TITLE>My first HTML document</TITLE>
>    </HEAD>
>    <BODY>
>       Hello world!
>    </BODY>
> </HTML>
>
> Both offer a similar error: "[Fatal Error] strict.dtd:81:5: The
> declaration
> for the entity "ContentType" must end with '>'".  Looking at the
> referenced
> DTDs http://www.w3.org/TR/html4/strict.dtd and
> http://www.w3.org/TR/html4/HTMLlat1.ent I see numerous ENTITY declarations
> with comments intermingled such as:
>
> <!ENTITY % ContentType "CDATA"
>     -- media type, as per [RFC2045]
>     -->
>
> Is this intermingling valid?  If so why would Xerces barf on it?  The XML
> 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006) mentions in section
> 2.5 Comments that "[comments] may appear within the document type
> declaration at places allowed by the grammar" but the grammar for entity
> declarations defined in 4.2 does not include comments between the
> opening <!
> and closing >.
>
> Any thoughts?
>
> Thanks,
> Ken Klose
>
Received on Thursday, 6 December 2001 19:22:37 UTC