XHTML1: Implementability of Appendix C from Bjoern Hoehrmann on 2004-07-12 (www-html-editor@w3.org from July to September 2004)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Mon, 12 Jul 2004 06:14:18 +0200
To: www-html-editor@w3.org
Message-ID: <418f1019.1817856930@smtp.bjoern.hoehrmann.de>
Dear HTML Working Group,

  RFC 2854 states in section 2,

[...]
  The text/html media type is now defined by W3C Recommendations; the
  latest published version is [HTML401]. In addition, [XHTML1] defines
  a profile of use of XHTML which is compatible with HTML 4.01 and which
  may also be labeled as text/html.
[...]

Section 5.1 of the XHTML 1.0 Second Edition Recommendation states:

[...]
  XHTML Documents which follow the guidelines set forth in Appendix C,
  "HTML Compatibility Guidelines" may be labeled with the Internet Media
  Type "text/html" [RFC2854], as they are compatible with most HTML
  browsers.
[...]

Section 3.1 of the XHTML Media Types Note states:

[...]
  [XHTML1], Appendix C "HTML Compatibility Guidelines" summarizes design
  guidelines for authors who wish their XHTML documents to render on
  existing HTML user agents. The use of 'text/html' for XHTML SHOULD be
  limited for the purpose of rendering on existing HTML user agents, and
  SHOULD be limited to [XHTML1] documents which follow the HTML
  Compatibility Guidelines.
[...]

So it seems crystal clear to me that this Appendix C of the XHTML 1.0
Second Edition Recommendation defines clear conformance criteria for
data objects which I would expect to be reliably machine-testable. It
however turns out that a number of sections of this appendix does not
deal with such conformance criteria, starting with Appendix C.1

[...]
  Be aware that processing instructions are rendered on some user
  agents. Also, some user agents interpret the XML declaration to mean
  that the document is unrecognized XML rather than HTML, and therefore
  may not render the document as expected. For compatibility with these
  types of legacy browsers, you may want to avoid using processing
  instructions and XML declarations. Remember, however, that when the
  XML declaration is not included in a document, the document can only
  use the default character encodings UTF-8 or UTF-16.
[...]

These appear to be at best criteria for authors, i.e., only authors
aware of this problem may deliver XHTML documents to legacy user agents.
So it seems I might misunderstand the purpose of the Appendix and all
the documents that refer to it. Which seems a bit odd. Ignoring the
sections that seem misplaced, the remaining sections are often not clear
about what the actual requirements are, or what the exact requirement
level is. Some sections use RFC 2119 keywords such as SHOULD and MUST,
some use loose imperative statements such as "avoid". It is not clear
to me how to map these statements into a precise error report, i.e.,
what maps into clear errors, warnings or something looser such as an
informational hint. It also seems inconsistent that you reference the
appendix as defining a profile, and yet state that the appendix is
informative.

It also seems that many requirements are missing from this "profile",
for example XHTML documents that use an internal subset will most likely
break in a legacy user agent as it would show the end delimiter ]> as
textual content, rather than hide it as I think would be required for
both compliant HTML and XHTML user agents. So I am not even sure what
the actual scope of the Appendix is to correct such flaws, if there
is actually anything wrong with it omitting such issues, myself.

So it seems close to impossible to write a good software tool that
checks whether a data object meets the constraints "defined" in that
"profile". Such a tools is however an often requested feature for the
W3C MarkUp Validator, as it, at least apparently, concerns the
compliance of documents. There is even special interest for authors
who wish to make their content accessible. One problem here that gets
more common every day is that documents are created using XML tools
that create things like

  <a name="x" id="x" />

for which visual inspection does not necessarily reveal any difference,
but A11y tools that rely on the internal document object model
representation of the document will likely note it as it would likely
break the document to some extend, see e.g. the Usenet discussion around

  http://groups.google.com/groups?selm=7n13605024figsokutl2qdsncpdfbk2g3a@4ax.com

where in

  http://groups.google.com/groups?selm=40649cb1.16491112@news.individual.net

we were able to identify the actual issue, after quite some effort that
was necessary due to the lack of a tool that properly checks for such
problems. I do not want to rely on wild guesses to write such a tool
properly and later waste time to fix problems I introduced for this
reasons, and take the blame for it once you clarify the unclear parts.
Hence I chose for the moment to wait until you clarify the issues I have
raised in the XHTML 1.0 Errata and a later XHTML 1.0 Third Edition. I
hope this will happen soon.

regards.
Received on Monday, 12 July 2004 00:15:02 UTC