- From: Shinichi Matsui <matsui@isl.mei.co.jp>
- Date: Fri, 7 Dec 2001 11:13:10 +0900
- To: "Christian Wolfgang Hujer" <Christian.Hujer@itcqis.com>, "Ken Klose" <ken.klose@imedium.com>, <www-html@w3.org>
Hi Christian, Thank you very much for your kind and detailed tutorial. Although there are some points I cannot fully agree, but I think this is very useful as a whole. One thing I want to point out is that W3C Recommendation "XHTML Basic" does not have a version number. Therefore, "XHTML Basic 1.0" is not correct, though the DOCTYPE declaration you wrote (which has "1.0" and "10") is not wrong. <http://www.w3.org/TR/xhtml-basic/> I (as one of the XHTML Basic editors) appreciate that you prefer XHTML Basic. :-) Regards, Shinichi Matsui matsui@isl.mei.co.jp "Christian Wolfgang Hujer" <Christian.Hujer@itcqis.com> wrote: > Hello Ken, > > > as Arnold already said, HTML is well-formed SGML, but no XML. > > > If you want to validate your pages using XML parsers, you need to use XHTML > instead of HTML. HTML itself won't be developed any further, anyway, except > for possible bug-fixes in the latest non-XML HTML version, which is HTML > 4.01, the one you used. > > > To migrate from HTML to XHTML, the future of HTML, follow these simple > rules: > > 0. Terms > element: something like <body>...</body> or <p>...</p> > attribute: something like src="..." in <img src="..." /> > tag: start tag, end tag or empty element tag > start tag: <body> or <table border="border"> > end tag: </body> > empty element tag: <hr /> or <br clear="all" /> > > 1. Never omit tags. > The following is valid HTML, but not valid XHTML: > <title>My first HTML document</title> > Hello world! > > 2. All names of elements and attributes are lowercase. Write <html> instead > of <HTML>. This does not apply for <!DOCTYPE since that is not a HTML > element but an SGML / XML instruction. > > 3. Always use quotes for attributes. > Do not write <body bgcolor=white>, write <body bgcolor="white"> > > 4. Always close elements > Do not write <p>First paragraph<p>Second paragraph, write <p>First > paragraph</p><p>Second paragraph</p> > > 5. Even close empty elements > Do not write <br>, write empty element tags like this: <br />. Formally you > could also write <br></br>, but most browsers will do two newlines then, and > you also could write <br/>, but Netscape Navigator then won't do any > newline. So write <br />, <hr /> <img src="..." /> and so on. > > 6. Attributes always have values > Even "boolean" attributes. write <hr noshade="noshade" /> or <td > nowrap="nowrap" /> if you use them at all. > > 7. Use the appropriate doctype declaration > Use one of the following: > > For XHTML 1.0 Strict, the XML version of HTML 4.01 Strict: > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> > > For XHTML 1.0 Transitional, the XML version of HTML 4.01 Transitional: > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > For XHTML 1.0 Frameset, the XML version of HTML 4.01 Frameset: > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"> > > For XHTML Basic 1.0, which is a quite device independant version of HTML, > use: > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" > "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> > > For XHTML 1.1, which is the successor of XHTML 1.0 Strict (Frameset and > Transitional are not supported anymore): > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" > "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> > > I personally prefer XHTML Basic 1.0 for most sites, sometimes I use XHTML > 1.1, rarely XHTML 1.0. > > > 8. Use the appropriate character encoding declaration > If you only use ASCII characters (those with representation numbers less or > equal than 127) you are not required to declare anything. > If you use some legacy encoding, you have to declare it for XML and, if it > is not ISO-8859-1, for HTML, like this: > <?xml version="1.0" encoding="iso-8859-2"?> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" > "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> > <html xml:lang="pl" xmlns="http://www.w3.org/1999/xhtml"> > <head> > <title>Polski Dokument</title> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" /> > </head> > <body> > Polski Dokument using polisch characters encoded not with character > entities but in iso-8859-2 ("Eastern Latin 1"). > </body> > </html> > > If you use UTF-8, declare it for old browsers: > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" > "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> > <html xml:lang="de" xmlns="http://www.w3.org/1999/xhtml"> > <head> > <title>Deutsches Dokument</title> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> > </head> > <body> > Deutsches Dokument using german characters encoded not in character > entities but in UTF-8. > </body> > </html> > > I recommend the use of ASCII only and encoding all Unicode characters with a > character number greater than 159 (128 to 159 are of no interest, they are > control characters and may not be used in XML documents anyway) using their > correspondig character entities, e.g. ü for the German u Umlaut or > Ą for the Polish A with "ogonek". > > > 9. Do not use CDATA-sections > > They are the new form for encoding scripts and style sheets, > <style type="text/css"><![CDATA[ > body {background:white;color:black;} > ]]></style> > is the new way for encoding scripts and style sheets, but do not use it, > since most browsers have big problems regarding CDATA-sections. > > > I hope that was helpful. > > > I also recommend the technical reports / recommendations on XML, XHTML 1.0, > Modularization of XHTML, XHTML 1.1, XHTML Basic 1.0 and the Ruby Module: > XML: http://www.w3.org/TR/REC-xml > XHTML 1.0: http://www.w3.org/TR/xhtml1 > XHTML Mod: http://www.w3.org/TR/xhtml-modularization/ > XHTML Basic 1.0: http://www.w3.org/TR/xhtml-basic > XHTML Ruby: http://www.w3.org/TR/ruby/ > XHTML 1.1: http://www.w3.org/TR/xhtml11/ > > Explanation: > XML is what XHTML is based on. > XHTML 1.0 is the first XML based version of HTML. The recommendation also > describes way for migration. > XHTML Mod is a framework for building new versions of HTML. XHTML has been > split up into several modules which can easily plugged together to create > individual versions of HTML. > XHTML Basic 1.0 is the first module based version of HTML, it consists of > all core modules and some small modules like basic tables and basic forms. > It is ideal to create device independant HTML. > XHTML Ruby is the first extension module. It is for ruby annotations, > something slightly similar to tables for annotating text, especially, but > not restricted to, asian writings. > XHTML 1.1 is the successor of XHTML 1.0 strict, it is based on XHTML Mod and > includes Ruby. > > > If you have further questions, you know whome to ask, just write to the list > or me. > > > Greetings > > Christian Hujer > > P.S.: > If you get parse errors on XHTML dtd external subsets when using XHTML > Modularization, it's your parser, not the documents. These are known bugs in > some XML parsers. But as far as I know, Xerces should be okay and work fine. > > P.P.S.: Disclaimer > No warranty for anything. > > > -----Original Message----- > > From: www-html-request@w3.org [mailto:www-html-request@w3.org]On Behalf > > Of Ken Klose > > Sent: Thursday, December 06, 2001 9:09 PM > > To: www-html@w3.org > > Subject: Are the public HTML DTDs valid XML? > > > > > > I'm trying to use Xerces (java) to parse the simple HTML document below. > > I've tried both versions 1.4.4 and 2.0.0b3. > > > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" > > "http://www.w3.org/TR/html4/strict.dtd"> > > <HTML> > > <HEAD> > > <TITLE>My first HTML document</TITLE> > > </HEAD> > > <BODY> > > Hello world! > > </BODY> > > </HTML> > > > > Both offer a similar error: "[Fatal Error] strict.dtd:81:5: The > > declaration > > for the entity "ContentType" must end with '>'". Looking at the > > referenced > > DTDs http://www.w3.org/TR/html4/strict.dtd and > > http://www.w3.org/TR/html4/HTMLlat1.ent I see numerous ENTITY declarations > > with comments intermingled such as: > > > > <!ENTITY % ContentType "CDATA" > > -- media type, as per [RFC2045] > > --> > > > > Is this intermingling valid? If so why would Xerces barf on it? The XML > > 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006) mentions in section > > 2.5 Comments that "[comments] may appear within the document type > > declaration at places allowed by the grammar" but the grammar for entity > > declarations defined in 4.2 does not include comments between the > > opening <! > > and closing >. > > > > Any thoughts? > > > > Thanks, > > Ken Klose > > > >
Received on Thursday, 6 December 2001 21:13:31 UTC