Comments on WD-xhtml1-20011004 from Bjoern Hoehrmann on 2001-10-07 (www-html-editor@w3.org from October to December 2001)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sun, 07 Oct 2001 21:21:09 +0200
To: www-html@w3.org
Cc: www-html-editor@w3.org
Message-ID: <1fn0st04c0c88g6b14ht9smeddfbo6m5pe@4ax.com>
Hi,

   Some comments on the latest XHTML 1.0 Second Edition Working Draft
(http://www.w3.org/TR/2001/WD-xhtml1-20011004/)

| W3C Working Draft 4 October 2001
| 
|    This version:
|           http://www.w3.org/TR/2001/WD-xhtml1-20011004
|           (Postscript version, PDF version, ZIP archive, or
|           Gzip'd TAR archive)
|           
|    Latest version:
|           http://www.w3.org/TR/xhtml1

This is currently XHTML 1.0 First Edition, calling it the latest version
of the draft is misleading.

|    Authors:
|           See acknowledgments.

No Editor?

| Abstract
| 
|    This specification defines the Second Edition of XHTML 1.0, a
|    reformulation of HTML 4 as an XML 1.0 application, and three
|    DTDs corresponding to the ones defined by HTML 4. The
|    semantics of the elements and their attributes are defined in
|    the W3C Recommendation for HTML 4. These semantics provide the
|    foundation for future extensibility of XHTML. Compatibility
|    with existing HTML user agents is possible by following a
|    small set of guidelines.

Yes, note "*existing* HTML user agents" (emphasis added). The document
states several times, that HTML4 conforming user agents don't have
problems rendering XHTML documents, this is certainly not true, consider
e.g. <br />, an HTML4 _conforming_ user agent treates the trailing slash
as character data, existing HTML user agents are not conforming HTML4
user agents.

| Status of this document

|    This document is the second edition of the XHTML 1.0
|    specification incorporating the errata changes as of 4 October
|    2001.

I still like to have an errata document for XHTML 1.0...

|                            1. What is XHTML?

|    The XHTML family is the next step in the evolution of the
|    Internet. By migrating to XHTML today, content developers can
|    enter the XML world with all of its attendant benefits, while
|    still remaining confident in their content's backward and
|    future compatibility.

Not really, if content providers want backward compatibility, they have
to deliver XHTML as text/html, thus the XHTML document is parsed as HTML
tag soup and not using an XML processor...

| 1.3 Why the need for XHTML?
| 
|    The benefits of migrating to XHTML 1.0 are described above.
|    Some of the benefits of migrating to XHTML in general are:
|      * Document developers and user agent designers are
|        constantly discovering new ways to express their ideas
|        through new markup. In XML, it is relatively easy to
|        introduce new elements or additional element attributes.
|        The XHTML family is designed to accommodate these
|        extensions through XHTML modules and techniques for
|        developing new XHTML-conforming modules (described in the
|        forthcoming XHTML Modularization specification). These
|        modules will permit the combination of existing and new
|        feature sets when developing content and when designing
|        new user agents.

I don't think this anachronism is a good deed to readers of this
specification, XHTML m12n is no longer "forthcoming"...

|   3.1.1 Strictly Conforming Documents
|   
|    A Strictly Conforming XHTML Document is a document that
|    requires only the facilities described as mandatory in this
|    specification. Such a document must meet all of the following
|    criteria:
|     1. It must conform to the constraints expressed in one of the
|        three DTDs found in Appendix A.
|     2. The root element of the document must be <html>.

Item 1 includes this constraint.

|     3. The root element of the document must designate the XHTML
|        namespace using the xmlns attribute [XMLNAMES]. The
|        namespace for XHTML is defined to be
|        http://www.w3.org/1999/xhtml.

It is not clear, whether it is required to add the attribute
specification explicitly or it is okay to rely on the #FIXED definition
of this attribute in one of the three DTDs. If it is okay, this item is
implied by item 4.

|     5. The DTD subset must not be used to override any parameter
|        entities in the DTD.

So I may use the internal subset to define my own entities in strictly
conforming XHTML documents? E.g.

  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [
    
      <!ENTITY myURI "http://www.websitedev.de/xhtml/xhtml1/"> 
    
  ]>
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <title></title>
  </head>
  <body>
  <p><a href="&myURI;">&myURI;</a></p>
  </body>
  </html>
  
If so, an additional compatibility guideline is necessary, stating that
XHTML document intended to be compatible with existing HTML user agents
must not define new entities. In general, a guideline is necessary to
disallow internal subsets in compatible XHTML documents, since existing
user agents render the closing "]>" and probably more, depending on what
the subset contains.

|    Here is an example of a minimal XHTML document:
|    
| <?xml version="1.0" encoding="UTF-8"?>
| <!DOCTYPE html
|      PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
| <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|   <head>
|     <title>Virtual Library</title>
|   </head>
|   <body>
|     <p>Moved to <a href="http://vlib.org/">vlib.org</a>.</p>
|   </body>
| </html>

The word "minimal" is ill-chosen, the minimal XHTML document would be
something like

  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <html>
    <head>
      <title/>
    </head>
    <body/>
  </html>
  
Note that the element body in XHTML has a different content model than
the body element in HTML4, HTML4 requires it to have some child. XHTML
1.0 does not list this change in section 4.

| 3.2 User Agent Conformance
| 
|    A conforming user agent must meet all of the following
|    criteria:

|     3. When a user agent processes an XHTML document as generic
|        XML, it shall only recognize attributes of type ID (e.g.
|        the id attribute on most XHTML elements) as fragment
|        identifiers.

This item should be removed, processing an XHTML document without XHTML
rules conflicts with the user agent conformance requirements itself.

|     4. If a user agent encounters an element it does not
|        recognize, it must process the element's content.

This item conflicts with validating user agents that stop normal
processing of the document on such errors. This item makes no sense in
general if it should only consider strictly conforming XHTML documents
or doesn't consider XHTML documents mixing multiple namespaces. It
should read "... encounters an element in the XHTML namespace ..." or be
removed. It also lacks of verbosity what "processing" means in this
context (what to do with it regarding style sheets, the DOM, etc.pp.)

|     5. If a user agent encounters an attribute it does not
|        recognize, it must ignore the entire attribute
|        specification (i.e., the attribute and its value).
|     6. If a user agent encounters an attribute value it does not
|        recognize, it must use the default attribute value.

... if there is a specified default value. I think it would be more
appropriate to combine item 5 and 6, and the attribute specification is
always ignored (using the default value is the same thing but does not
consider probably inheritet properties).

|     7. If it encounters an entity reference (other than one of
|        the predefined entities) for which the User Agent has
|        processed no declaration (which could happen if the
|        declaration is in the external subset which the User Agent
|        hasn't read), the entity reference should be processed as
|        the characters (starting with the ampersand and ending
|        with the semi-colon) that make up the entity reference.

This could also happen if the declaration is included in some other
external entity.

|     8. When processing content, User Agents that encounter
|        characters or character entity references that are
|        recognized but not renderable should display the document
|        in such a way that it is obvious to the user that normal
|        rendering has not taken place.

Thus rendering the euro currency sign using the string "EUR" is a
violation of this conformance rule, since it is not obvious to the
user, that normal rendering has not taken place?

|        The user agent must process white space characters in the
|        data received from the XML processor as follows:
|           + After XML end of line normalization, white space
|             characters must not be removed from the XHTML Infoset
|             as they may be processed subsequently by style
|             language processors. Terms like 'removed',
|             'preserved', 'converted' and 'reduced' used in the
|             next clauses only apply to rendering consideration by
|             the user agent in the absence of additional
|             processing by a style language processor.
|           + All white space surrounding block elements should be
|             removed.

This is not possible to implement, HTML4 and XHTML 1.0 do not define
what elements are block elements.

|        In determining how to convert a LINE FEED character a user
|        agent must meet the following rules, whereby the script of
|        characters on either side of the LINE FEED determines the
|        choice of the replacement. The assignment of script names
|        to all characters is done in accordance to the Unicode
|        [UNICODE] technical report TR#24 (Script Names).

There should be a reference to UTR 24.

There is one white-space rule missing, in HTML4 line feed characters
immediantly following a start-tag and immediantly preceding an end-tag
must be removed.

|    Note that in order to produce a Canonical XHTML document, the
|    rules above must be applied and the rules in [XMLC14N] must
|    also be applied to the document.

Make this: "Note that in order to produce a Canonical XHTML document,
the above rules and the rules in [XMLC14N] must be applied to the
document"

| 4.7 White Space handling in attribute values
| 
|    In attribute values, user agents will strip leading and
|    trailing white space from attribute values and map sequences
|    of one or more white space characters (including line breaks)
|    to a single inter-word space (an ASCII space character for
|    western scripts). See Section 3.3.3 of [XML].

I recommend to change this to "...to a single space character." for
clarity. At least characters should be expressed in Unicode terms,
rather than US-ASCII terms.

| C.1 Processing Instructions and the XML Declaration
| 
|    Be aware that processing instructions are rendered on some
|    user agents. However, also note that when the XML declaration
|    is not included in a document, the document can only use the
|    default character encodings UTF-8 or UTF-16.

... unless the encoding is determined by higher level protocol
information. Make this:

  "Be aware that processing instructions and the XML declaration are
   incorrectly rendered on some user agents. However, also note that in
   absence of higher-level protocol information, the document can only
   use the default encodings UTF-8 or UTF-16."

| C.4 Embedded Style Sheets and Scripts
| 
|    Use external style sheets if your style sheet uses < or & or
|    ]]> or --. Use external scripts if your script uses < or & or
|    ]]> or --. Note that XML parsers are permitted to silently
|    remove the contents of comments. Therefore, the historical
|    practice of "hiding" scripts and style sheets within
|    "comments" to make the documents backward compatible is likely
|    to not work as expected in XML-based implementations.

The sequence '--' would only matter if the script content is a comment.
XHTML user agents must not interprete scripts inside comments (in fact,
they must be removed on parsing, thus XHTML user agents are not only
premitted to remove them), this would fail horribly in cases like

  <script type='text/ecmascript'>
  <!-- script introduction -->
  <![CDATA[
    /* ... */
  ]]>
  <!-- Copyright (c) by someone Inc. -->
  </script>

So mentioning "--" is misleading.

The specification lacks of a discussion of CDATA sections for compatible
XHTML documents. CDATA sections are not compatible with existing user
agents, authors should avoid using them, especially in this context.

| C.8 Fragment Identifiers
| 
|    In XML, URI-references [RFC2396] that end with fragment
|    identifiers of the form "#foo" do not refer to elements with
|    an attribute name="foo"; rather, they refer to elements with
|    an attribute defined to be of type ID,

Hm, I wonder compatibility issues arise for XPointers with this point of
view, one cannot define

  <h1 id='xpointer(//*)' />

or something like that...

| C.9 Character Encoding
| 
|    To specify a character encoding in the document, use both the
|    encoding attribute specification on the xml declaration (e.g.
|    <?xml version="1.0" encoding="EUC-JP"?>) and a meta http-equiv
|    statement (e.g. <meta http-equiv="Content-type"
|    content='text/html; charset=EUC-JP' />). The value of the
|    encoding attribute of the xml declaration takes precedence.

I like to see here a note, encouraging authors to specify the encoding
in the HTTP header.

| C.11 Document Object Model and XHTML

I've asked what the WG has in mind to do with the DOM for XHTML
documents. I didn't get any answer...

| C.12 Using Ampersands in Attribute Values
| 
|    When an attribute value contains an ampersand, it must be
|    expressed as a character entity reference (e.g. "&amp;"). For
|    example, when the href attribute of the a element refers to a
|    CGI script that takes parameters, it must be expressed as
|    http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=us
|    er rather than as
|    http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.

This is no compatibility issue, HTML4 has the same rules. A difference
between XHTML and HTML is the fact, that Ampersands in general have to
be expressed using character references, while HTML had loser rules,
e.g. <abbr title="Sam & Max">... is legal in HTML but not in XHTML.

| C.13 Cascading Style Sheets (CSS) and XHTML

|     1. CSS style sheets for XHTML should use lower case element
|        and attribute names.

In order to match, this is a must.

| C.14 Referencing Style Elements when serving as XML
| 
|    In HTML 4 and XHTML, the style element can be used to define
|    document-internal style rules. In XML, an XML stylesheet
|    declaration is used to define style rules. In order to be
|    compatible with this convention, style elements should have
|    their fragment identifier set using the id attribute, and an
|    XML stylesheet declaration should reference this fragment. For
|    example:

There is no reference to the xml-stylesheet processing instruction
recommendation. Since this is supposed to be a compatibility guideline,
where is the guideline on using the xml-stylesheet processing
instruction?

| <?xml-stylesheet href="http://www.w3.org/StyleSheets/TR/W3C-REC.css" ty
| pe="text/css"?>
| <?xml-stylesheet href="#internalStyle" type="text/css"?>
| <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|   <head>
|     <title>An internal stylesheet example</title>
|     <style id="internalStyle">

"Missing required attribute 'type' for element 'style'"...

|       code {
|         color: green;
|         font-family: monospace;
|         font-weight: bold;
|       }

Always specify both, foreground *and* background colors at the same
level of specifity.

Since it seems somehow possible for conforming XHTML useragents to parse
XHTML documents as generic XML (this stays however in conflict with the
user agent conformance rules but I didn't invent section 3.2.3...) I
miss a discussion of this here. If an XHTML document is beeing processed
as generic XML, style sheets need a lot of 'display' definitions for the
elements. To avoid absurdity, I recommend to remove item 3 in section
3.2.

| C.15 White Space Characters in HTML vs. XML
| 
|    Some characters that are legal in HTML documents, are illegal
|    in XML document. For example, in HTML, the Formfeed character
|    (U+000C) is treated as white space, in XHTML, due to XML's
|    definition of characters, it is illegal.

The character is named "FORM FEED", it's the only character allowed in
HTML and forbidden in XML. This is a difference to HTML 4, not a
compatibility guideline. There are a lot of characters forbidden in
HTML4 and allowed in XML 1.0 (e.g. C2 control characters), what is their
state in XHTML 1.0? This information should be added.

regards,
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Received on Sunday, 7 October 2001 15:22:18 UTC