Re: Publication of specifications as HTML5

On Sun, Aug 21, 2011 at 11:15 PM, Liam R E Quin <liam@w3.org> wrote:
> Seems to me a requirement should be that the format issuitable for
> archiving.

That's very reasonable.

> This means that the document indicates to exactly which
> version of which specification(s) it conforms, and actually does
> conform.
>
> Without a formal public identifier in the doctype declaration, or a
> version attribute on the HTML element, I don't personally consider it
> acceptable to use HTML 5 in a situation in which long term archiving is
> expected.
>
> Even with such version indication, HTML 5 must obviously not be used in
> archived contexts until it is a stable specification - in W3C terms that
> means a Recommendation.

That doesn't follow.  Archivability of the format is how reliably we
expect it to be readable into the distant future.  Just because
something ostensibly conforms to a Recommendation doesn't mean it's
readable at all.  It would be possible to construct a document that's
valid HTML 4.01 and completely readable in every major browser, but
which is not decipherable using something based solely on the HTML
4.01 standard, because it relies on browser behavior that the standard
either doesn't specify or specifies differently from how browsers
behave.  In fact, it would be easy to do this, because HTML 4.01 is
extremely vague, makes extremely few testable assertions, and has no
test suite at all as far as I know.  A document that conforms to HTML
4.01 is no more archivable than one that conforms to only de facto
standards, because HTML 4.01 doesn't define anything in more precision
than a de facto standard anyway.

Moreover, just because something is not yet a Recommendation doesn't
mean it's not stable.  The relevant parts of HTML and CSS are fixed in
stone because of browser compatibility constraints.  If the browsers
of ten or twenty years from now can't reliably display pages pretty
much the same way that browsers display them now, a huge part of the
web will no longer work.  Browsers today can all display typical ten-
or twenty-year-old pages with no big problems, because if they start
displaying existing pages differently, they immediately get angry
complaints from users.  All evidence suggests this will be a reliable
invariant going forward.

In fact, the HTMLWG explicitly considered the question of whether to
add version identifiers to HTML5:
<http://lists.w3.org/Archives/Public/public-html/2010Dec/0135.html>.
It concluded that a version indicator is not necessary, because
(roughly) all future versions of HTML are expected to be
backward-compatible, and in the unlikely event that they're not, a
version indicator can be added at that point.  Do you think that the
W3C shouldn't use HTML5 for its publications ever if its staff
disagrees with its Working Group on the necessity of version
indicators?  Isn't the Working Group responsible for technical
decisions relevant to the standards it works on, not the W3C
administration?

Still further, HTML is one of the world's most widely-used data
formats.  In the extremely unlikely event that it ever goes away or
changes such that old documents aren't readable anymore, we can be
sure there will be a very long transition period where legacy HTML
processors are available, and will be usable to convert old documents
to new formats.  This would not be an ideal situation, but it's
exceedingly unlikely and still quite manageable in the hypothetical
situation where it does occur.

There's also the fact that there are no restrictions on content that's
included from other files.  Specifications can use CSS/JS features or
image formats or whatnot that are unstably specified or not specified
at all, but not new HTML features.  W3C specifications commonly rely
on CSS to distinguish normative from non-normative text, but there are
no restrictions on what standards that CSS must conform to.  Even *if*
archivability were really an issue here, and even *if* requiring
standardization were really a solution, the status quo doesn't require
standardization at all for key parts of the document.

Also, HTML is by its nature a text-based format, and the normative
portions of the specifications we're discussing are all text.  Even if
you knew absolutely nothing about HTML as a format, reading the source
code would more than suffice to correctly decipher the standard, given
a little work.

I could raise more objections here -- like pointing out the
unlikelihood of any W3C specifications being useful in the event that
no one knows how to read HTML anymore -- but I think I've made my
point.  Standardization is neither necessary nor sufficient for
archivability, and the document formats under discussion are so widely
used that they'd be suitable for indefinite archival regardless of
whether they were meaningfully standardized (which indeed they largely
were not prior to HTML5).  HTML5 is every bit as archivable as HTML
4.01 in practice, regardless of nominal maturity.  On the other hand,
HTML5 will not reach Recommendation for probably another decade, and
in the meantime other specifications are stuck using an obsolete set
of features.

The request I made was completely practical: there are useful features
in HTML5 and W3C specs should be able to take advantage of them.  Do
you have any objections that are comparably practical?  Do you foresee
any concrete, short- to medium-term harm from permitting the use of
HTML5 for W3C specifications?  Or are the issues you have with
publication as HTML5 solely a matter of principle?

Received on Monday, 22 August 2011 20:34:30 UTC