- From: Christian Wolfgang Hujer <Christian.Hujer@itcqis.com>
- Date: Fri, 7 Dec 2001 01:20:40 +0100
- To: "Ken Klose" <ken.klose@imedium.com>, <www-html@w3.org>
Hello Ken, as Arnold already said, HTML is well-formed SGML, but no XML. If you want to validate your pages using XML parsers, you need to use XHTML instead of HTML. HTML itself won't be developed any further, anyway, except for possible bug-fixes in the latest non-XML HTML version, which is HTML 4.01, the one you used. To migrate from HTML to XHTML, the future of HTML, follow these simple rules: 0. Terms element: something like <body>...</body> or <p>...</p> attribute: something like src="..." in <img src="..." /> tag: start tag, end tag or empty element tag start tag: <body> or <table border="border"> end tag: </body> empty element tag: <hr /> or <br clear="all" /> 1. Never omit tags. The following is valid HTML, but not valid XHTML: <title>My first HTML document</title> Hello world! 2. All names of elements and attributes are lowercase. Write <html> instead of <HTML>. This does not apply for <!DOCTYPE since that is not a HTML element but an SGML / XML instruction. 3. Always use quotes for attributes. Do not write <body bgcolor=white>, write <body bgcolor="white"> 4. Always close elements Do not write <p>First paragraph<p>Second paragraph, write <p>First paragraph</p><p>Second paragraph</p> 5. Even close empty elements Do not write <br>, write empty element tags like this: <br />. Formally you could also write <br></br>, but most browsers will do two newlines then, and you also could write <br/>, but Netscape Navigator then won't do any newline. So write <br />, <hr /> <img src="..." /> and so on. 6. Attributes always have values Even "boolean" attributes. write <hr noshade="noshade" /> or <td nowrap="nowrap" /> if you use them at all. 7. Use the appropriate doctype declaration Use one of the following: For XHTML 1.0 Strict, the XML version of HTML 4.01 Strict: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> For XHTML 1.0 Transitional, the XML version of HTML 4.01 Transitional: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> For XHTML 1.0 Frameset, the XML version of HTML 4.01 Frameset: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"> For XHTML Basic 1.0, which is a quite device independant version of HTML, use: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> For XHTML 1.1, which is the successor of XHTML 1.0 Strict (Frameset and Transitional are not supported anymore): <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> I personally prefer XHTML Basic 1.0 for most sites, sometimes I use XHTML 1.1, rarely XHTML 1.0. 8. Use the appropriate character encoding declaration If you only use ASCII characters (those with representation numbers less or equal than 127) you are not required to declare anything. If you use some legacy encoding, you have to declare it for XML and, if it is not ISO-8859-1, for HTML, like this: <?xml version="1.0" encoding="iso-8859-2"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> <html xml:lang="pl" xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Polski Dokument</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" /> </head> <body> Polski Dokument using polisch characters encoded not with character entities but in iso-8859-2 ("Eastern Latin 1"). </body> </html> If you use UTF-8, declare it for old browsers: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> <html xml:lang="de" xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Deutsches Dokument</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> Deutsches Dokument using german characters encoded not in character entities but in UTF-8. </body> </html> I recommend the use of ASCII only and encoding all Unicode characters with a character number greater than 159 (128 to 159 are of no interest, they are control characters and may not be used in XML documents anyway) using their correspondig character entities, e.g. ü for the German u Umlaut or Ą for the Polish A with "ogonek". 9. Do not use CDATA-sections They are the new form for encoding scripts and style sheets, <style type="text/css"><![CDATA[ body {background:white;color:black;} ]]></style> is the new way for encoding scripts and style sheets, but do not use it, since most browsers have big problems regarding CDATA-sections. I hope that was helpful. I also recommend the technical reports / recommendations on XML, XHTML 1.0, Modularization of XHTML, XHTML 1.1, XHTML Basic 1.0 and the Ruby Module: XML: http://www.w3.org/TR/REC-xml XHTML 1.0: http://www.w3.org/TR/xhtml1 XHTML Mod: http://www.w3.org/TR/xhtml-modularization/ XHTML Basic 1.0: http://www.w3.org/TR/xhtml-basic XHTML Ruby: http://www.w3.org/TR/ruby/ XHTML 1.1: http://www.w3.org/TR/xhtml11/ Explanation: XML is what XHTML is based on. XHTML 1.0 is the first XML based version of HTML. The recommendation also describes way for migration. XHTML Mod is a framework for building new versions of HTML. XHTML has been split up into several modules which can easily plugged together to create individual versions of HTML. XHTML Basic 1.0 is the first module based version of HTML, it consists of all core modules and some small modules like basic tables and basic forms. It is ideal to create device independant HTML. XHTML Ruby is the first extension module. It is for ruby annotations, something slightly similar to tables for annotating text, especially, but not restricted to, asian writings. XHTML 1.1 is the successor of XHTML 1.0 strict, it is based on XHTML Mod and includes Ruby. If you have further questions, you know whome to ask, just write to the list or me. Greetings Christian Hujer P.S.: If you get parse errors on XHTML dtd external subsets when using XHTML Modularization, it's your parser, not the documents. These are known bugs in some XML parsers. But as far as I know, Xerces should be okay and work fine. P.P.S.: Disclaimer No warranty for anything. > -----Original Message----- > From: www-html-request@w3.org [mailto:www-html-request@w3.org]On Behalf > Of Ken Klose > Sent: Thursday, December 06, 2001 9:09 PM > To: www-html@w3.org > Subject: Are the public HTML DTDs valid XML? > > > I'm trying to use Xerces (java) to parse the simple HTML document below. > I've tried both versions 1.4.4 and 2.0.0b3. > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" > "http://www.w3.org/TR/html4/strict.dtd"> > <HTML> > <HEAD> > <TITLE>My first HTML document</TITLE> > </HEAD> > <BODY> > Hello world! > </BODY> > </HTML> > > Both offer a similar error: "[Fatal Error] strict.dtd:81:5: The > declaration > for the entity "ContentType" must end with '>'". Looking at the > referenced > DTDs http://www.w3.org/TR/html4/strict.dtd and > http://www.w3.org/TR/html4/HTMLlat1.ent I see numerous ENTITY declarations > with comments intermingled such as: > > <!ENTITY % ContentType "CDATA" > -- media type, as per [RFC2045] > --> > > Is this intermingling valid? If so why would Xerces barf on it? The XML > 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006) mentions in section > 2.5 Comments that "[comments] may appear within the document type > declaration at places allowed by the grammar" but the grammar for entity > declarations defined in 4.2 does not include comments between the > opening <! > and closing >. > > Any thoughts? > > Thanks, > Ken Klose >
Received on Thursday, 6 December 2001 19:22:37 UTC