Spellchecker for the DOM from Sean B. Palmer on 2011-03-12 (www-archive@w3.org from March 2011)

From: Sean B. Palmer <sean@miscoranda.com>
Date: Sat, 12 Mar 2011 12:42:08 +0000
To: www-archive@w3.org
Message-ID: <AANLkTinLGdDzxn=KatYoJSkum_NFz5ddJQsHG8jJCdJb@mail.gmail.com>
This message argues that syntax validation should be forgotten as an
archaic and useless practice. Heuristic structural validation,
especially concerning the transformation to and resulting structure of
the DOM or something like it, should instead be championed. This may
be thought of as a kind of spellchecker on the DOM approach.

In elder days long ago, HTML validation was done in one of two ways.
You could check your page by seeing if it looked good in the browsers,
or you could put it into a syntax validator to see if there were any
errors. The first way was repudiated because HTML pages are
multimodal, so they don't look like anything conceptually. The second
way was repudiated because it bore little resemblance to reality, so
that for example many of the errors actually had no disadvantageous
practical effect.

Things are a little different now, but not much. The first way of
validating still reigns, but there is an extra dimension. As well as
checking that the layout is okay, you often have to check the
interaction. Do your ajax callbacks trigger when you mouseover the
form element? Does your jQuery style apply properly, or did you get
the class name on the applicable div element wrong? The syntax method
of validation seems even further removed from this reality because
structure, not behaviour, is the most salent feature to emerge from
the syntax as validated.

Syntax validation is boolean and cardinal. The boolean is whether you
are conformant. You are either conformant, so your page syntax is
good, or there are errors. The cardinal is in how many errors you
have. A page with one error is just an unconformant as a page with
500, but the page with one error is easier to fix. TimBL admitted that
this model gave people the wrong impression of validation, and
proposed an inverse system, where you receive a score based on how few
errors there are. The fewer the errors, the closer to a perfect "100"
you are.

Page authors do not, however, want to receive a score and achievements
and display proud badges; not ultimately. They want most of all for
their pages to work in the way that they intend them to work. "Your
page is XHTML 1.0 Strict valid!" and "Your page scores 100 on the
HTML5 awesome scale!" mean nothing. "Your page works as you intend!"
means something.

Validation therefore needs to be practical. Syntax validation is
misleading because all syntaxes are valid, behaviourally speaking. If
a browser crashes on any byte string sent as HTML, then no matter how
malformed and hideous the putative HTML, that is a browser I don't
want to use. The browser may not show me a page if there is a security
error or similar problem, but it should not crash.

If there is no invalid syntax, what does it mean to validate a syntax?
Historically this has meant that some document issued by some
organisation constructed a compound conformance criterion. Why did
they do that?

Say that you have this in your page:

<meta encoding="utf-8">

This will not do as you intend, because no browser will understand the
encoding attribute. The new encoding hack works because it squats on
liberal parsing output from the text/html content type charset
parameter. This would work:

<meta charset="utf-8">

So in the syntax, we can say that charset="utf-8" is valid, and
encoding="utf-8" is invalid. What we really mean is that charset is
implemented in some code somewhere and produces an effect when used,
whereas encoding produces no effect. The concept of "produces no
effect" has been twisted into "invalid". Consider the following:

<meta custom="utf-8">

What is this? This is not so obviously an intention to set an encoding
for the present document. It looks more like some attribute to be
slurped up by some unknown code. This is dangerous practice, because
in future a more popular piece of code may use the same syntax to
produce a different effect. The supposed value of syntax validation
here is that by flagging custom as invalid, the author of the document
is made to see how pernicious their attribute is.

But what does the author of the page need to know? The author of the
page doesn't need to know that custom is a custom attribute. They
already know that. You can bark at them that this is invalid, but they
added it deliberately in the first place, so they're not actually
going to care. They would want to know if they wrote this
accidentally:

<meta custmo="utf-8">

Where they intended to write "custom" but wrote the typo "custmo" with
the accidental metathesis of the two characters "o" and "m". But to
them, the custom attribute is valid-but-proprietary and the custmo
attribute is invalid, which is to say it has no effect in their
software. They might find this by using their software, but their
software may only produce misleading results, not obviously broken
results. To anyone else, both attributes are equally meaningless, they
produce no effect in any software they know about.

People don't need validators to moralise at them, they need validators
to provide them information. That an attribute is not implemented in
any major browser or other HTML software is information. That an
attribute is invalid is moralisation, of a sort. If in fact a custom
attribute suddenly appears in a major browser, at that point you can
be sure the author of the document will care about it!

This example is expressed in terms of syntax, but the point is that
the effect is what matters. The DOM is important because the DOM is
the first obvious effect of syntax. The first substantive thing a
browser does is convert some input into a DOM, then it works with the
DOM. In fact, the charset attribute is used as an example here because
it is one of the few things which doesn't really concern the DOM as
such, so it's an easier example to give.

Consider, then, something more obviously DOM oriented:

<p>This is a <em>very <strong>good</em> example</strong>.

This is not valid HTML, but it works in any decent browser. You can
predict what sort of structure you're going to get out of it quite
easily. In visual terms, "very" will be italic, "good" will be bold
and italic, and "example" will be bold. What kind of DOM structure do
you get though? Either of these look sensible, for example:

1. <em>very <strong>good</strong></em><strong> example</strong>
2. <em>very </em><strong><em>good</em> example</strong>

Quite possibly you get different DOMs from different browsers. You
might want to know this if you're interacting with the emphasis in
some way, such as if you have a script to make a poem interactive. If
you don't have such interaction, you won't care, and it will be valid
no matter what. If you do care, you don't so much care which of the
DOMs will be the result, so much as whether browsers will be
consistent and your interactive code will work across all browsers.
This is information.

That validation should be DOM oriented, therefore, does not mean that
the syntax can be wholly disregarded. What we're interested in is
always the effects of the concrete syntax. The DOM comes from a
transformation of the syntax to a structure. We may, but don't always,
want to know for example whether this transformation is standardised
across all browsers. We may, but don't always, want to know the actual
result of the transformation on the syntax that we feed it. We don't
ever care whether the syntax matches some syntax specification, but
only in regards to the reason behind the syntax specification being as
it is.

The roots of this mistake lie in basing HTML on SGML, which was a
tremendous error which has now been recognised. This was recognised in
baby steps. First, the problem was thought to lie in the complexity of
SGML, so XML was made. Then the problem was thought to lie in the
complexity of XML, so Bray devised an XML without processing
instructions and other superfluous features. Meanwhile, people who
actually have to use HTML were making things like Markdown and Textile
to make authorship more easy, pointing to what HTML should have been
like in the first place. But there are benefits to a consistent
structure, it's just that even the subset XML syntax didn't go far
enough.

The biggest mistake that the subset XML made was that, again, the
input was too fragile. There could be byte strings which were not
valid in this language, and that concept of invalidity was tied to a
draconian user agent error recovery process: if there is an error,
abort the processing and do not render the content. This was supposed
to make it easier on beleaguered implementors, but the implementors
are few and the authors many.

The WHAT WG's living, breathing, walking, talking, organic HTML
specification (what do we call it now that "HTML5" is so passé?) does
have a processing model which is a first step along this route. The
concept of a conformance checker in that specification is very
outdated, for the reasons outlined above. But the error recovery
process is well defined. The main problem with that is here:

"The error handling for parse errors is well-defined: user agents must
either act as described below when encountering such problems, or must
abort processing at the first error that they encounter for which they
do not wish to apply the rules described below."

What does it mean to "must" abort processing? User agent authors can
and probably shall do as they please. They do not have to follow the
error recovery process that the specification defines, but may in
practice innovate if they need to. Such innovation should be passed
back to the specification editor for possible inclusion. They may also
not abort, but for example give warnings. This isn't covered by the
specification as it stands. The very outdated notion of a conformance
checker is the worse problem, but this is the sort of way in which its
presence is still felt in practical terms.

When you change the conformance checker, you change the tone of the
language. If we had conformance checkers that works along the
principles outlined herein, not only would they be more realistic and
effective, but that effectiveness may seep into future evolutions of
HTML itself. The DOM may be seen for the ogre in the room that it is,
and gradually updated to be sleeker and more in line with tools that
people actually use, such as jQuery or Prototype. The syntax to
structure transformation, currently considered just a processing
model, may become an SGML or XML like language in its own right,
without their manifold obstreperous burrs.

Just because the W3C have had their responsibility as keepers of HTML
de facto abrogated due to so many factors, and HTML released into a
new freedom as a result, there is no reason why Sturgeon's Law should
not continue to operate in the language design domain. What I outline
here attempts to confront that fact head on, instead of wrestling with
it in such a way which only produces baby fractal sturgeon roe inside
the mother sturgeon.

I was going to send this to www-html@w3 and whatwg@whatwg, but I do
not in fact intend to discuss this message further in email. Anyway,
Spiderman found that www-archive is the most productive mailing list,
where people may cause a constitutional crisis from mere
unprofessionalism. I may be available as sbp in #swhack on freenode
should anyone find a strong desire to berate me.

-- 
Sean B. Palmer, http://inamidst.com/sbp/
Received on Saturday, 12 March 2011 12:42:41 UTC