XHTML from Rick Jelliffe on 1999-05-21 (www-html@w3.org from May 1999)

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Fri, 21 May 1999 00:08:48 -0400 (EDT)
To: <www-html@w3.org>
Cc: "w3c i18n ig" <w3c-i18n-ig@w3.org>
Message-ID: <002d01bea33e$1bb06f00$dd066d8c@sinica.edu.tw>
Here are some comments on XHTML.

1)  XHTML 1.0 Last Call Disposition of Comments,
http://www.w3.org/MarkUp/Group/1999/xhtml1-lc1-doc-19990506.html,
3.5.LanguageCode Parameter Entity  leaves the language code as CDATA. I
would be very interested if you (or Mischa or Martin, if they know) would
post a little note to w3c-i18n-ig@w3.org to explain this.  CDATA indicates
that any characters may be allowed, rather than following RFC 1766.

Suggestion: Declare lang as NMTOKENS. This is more forgiving than NMTOKEN
and allows some variant and incorrect use, but will not confuse programmers
and educators that any old text is allowed.


2) XHTML 1.0 Last Call Disposition of Comments,
http://www.w3.org/MarkUp/Group/1999/xhtml1-lc1-doc-19990506.html,
s2.1.6 SGML Newline Handling Requirements, says

  Resolution: The document relies upon XML for its definition of whitespace
handling. This includes handling of line boundaries. No change to the
document is required

However, XML 1.0 only allows "preserve" or "default", where "default" is
undefined, but may be the SGML behaviour. Because of this, the XHTML draft
cannot rely on XML for its definition of whitespace handling for "default".

Suggestion: XHTML should follow SGML.


3) Netscape 4.6 still does not support hexadecimal numeric character
references in HTML.

Suggestion: Put a caution about using Hexadecimal Numeric Character
References, that it may not be backward compatible with HTML browsers.


4) Deployed HTML browsers do not nicely allow the following parts of XML:
  * Internal subset, and hence entities, notations, additional attribute
declarations;
  * Hexadecimal character references (if 3 above);
  * CDATA marked sections;
  * PIs;
  * hence, the XML header (which, as the inventor of it, let me say IMHO
*is* a PI,
since it provides information to software on how to process something,
starting at
a point: in the same way, all markup declarations are PIs);
   * hence, selecting encoding with the XML header.

By trying to find a subset of XML which is fairly acceptable to HTML
browsers,  there is a grave danger of setting a course which will disrupt
XML.  Of course XML was developed to overcome many of the perceived problems
of HTML, not to perpetuate them.

It worries me a little that the forces to find this subset may be so strong
that HTML-browser compatability becomes a criterion for judging particular
XML features. This has already happened to some extent with PIs (notably the
use of attributes for Namespace declarations rather than a PI in the header:
in that case I think it was a fair call, in that namespace hang off names
which hang of attributes, they do not hang off documents, entities or random
points, which is where PIs are appropriate).

Recommendation:  The XHTML effort should split into three parts:

 * XHTML, a version of HTML 4.0 which allows all XML features and any new
W3C technology. Application vendors should be encouraged to support this. It
should have the MIME media-type     text/xhtml-xml  (the "-xml" prefix is
one of the current suggestions for the MIME XML group which is finding
favour).  XHTML should have one extra
requirement to XML: the XML header should be mandatory.

 * WFHTML, a interim version of XHTML which is compatible with generation 4
and 5 browsers. Users should be encouraged to use this HTML syntax. It
should have the MIME media type text/html.  WFHTML differs from XML in the
following ways:
  i) it only uses elements, data, comments, NCRs, the DOCTYPE declaration
and the encoding PI;
  ii) WF errors do not halt parsing;
  iii) the XML header is not mandatory;
  iii) encoding should be determined by the MIME charset; if that is not
available the encoding attribute in the XML header may be used; if that is
not available, the META tag may be used; if that is not available, guessing
may be used.

* An XHTML-to-WFHTML transformation recommendation. Webservers should
support content-negotiation of XHTML or (WF)HTML. If a document is available
as XHTML but not as (WF)HTML, then some on-the-fly, server-side
transformation may be provided: a simple application or an XSL stylesheet
for example.  In other words, transformation from XHTML to WFHTML should be
transparent to users and to creators of XHTML data. In particular, a
transformation that PIs should be placed in comments: <?xml version="1.0"?>
should be <!--<?xml version="1.0"?>-->.  This would also discourage the
deployment of processors which only accept XML subsets: a disasterous
development.

It seems to me that, even though this *seems* complicated, it is the only
way to reconcile all the different requirements. Furthermore, I think it is
technologically sound and practical from a deployment point of view: at most
it requires registration of an XHTML handler to webservers not to browsers
(i.e., unless XHTML documents are not pre-transformed into WFHTML and
content negotiation is used for delivery).

This allows people to move to "real" XML, rather than a watered-down
version.


Rick Jelliffe
Academia Sinica Computing Centre
Taipei, Taiwan
Received on Friday, 21 May 1999 05:13:34 UTC