- From: Sean B. Palmer <sean@miscoranda.com>
- Date: Sat, 12 Mar 2011 12:42:08 +0000
- To: www-archive@w3.org
This message argues that syntax validation should be forgotten as an archaic and useless practice. Heuristic structural validation, especially concerning the transformation to and resulting structure of the DOM or something like it, should instead be championed. This may be thought of as a kind of spellchecker on the DOM approach. In elder days long ago, HTML validation was done in one of two ways. You could check your page by seeing if it looked good in the browsers, or you could put it into a syntax validator to see if there were any errors. The first way was repudiated because HTML pages are multimodal, so they don't look like anything conceptually. The second way was repudiated because it bore little resemblance to reality, so that for example many of the errors actually had no disadvantageous practical effect. Things are a little different now, but not much. The first way of validating still reigns, but there is an extra dimension. As well as checking that the layout is okay, you often have to check the interaction. Do your ajax callbacks trigger when you mouseover the form element? Does your jQuery style apply properly, or did you get the class name on the applicable div element wrong? The syntax method of validation seems even further removed from this reality because structure, not behaviour, is the most salent feature to emerge from the syntax as validated. Syntax validation is boolean and cardinal. The boolean is whether you are conformant. You are either conformant, so your page syntax is good, or there are errors. The cardinal is in how many errors you have. A page with one error is just an unconformant as a page with 500, but the page with one error is easier to fix. TimBL admitted that this model gave people the wrong impression of validation, and proposed an inverse system, where you receive a score based on how few errors there are. The fewer the errors, the closer to a perfect "100" you are. Page authors do not, however, want to receive a score and achievements and display proud badges; not ultimately. They want most of all for their pages to work in the way that they intend them to work. "Your page is XHTML 1.0 Strict valid!" and "Your page scores 100 on the HTML5 awesome scale!" mean nothing. "Your page works as you intend!" means something. Validation therefore needs to be practical. Syntax validation is misleading because all syntaxes are valid, behaviourally speaking. If a browser crashes on any byte string sent as HTML, then no matter how malformed and hideous the putative HTML, that is a browser I don't want to use. The browser may not show me a page if there is a security error or similar problem, but it should not crash. If there is no invalid syntax, what does it mean to validate a syntax? Historically this has meant that some document issued by some organisation constructed a compound conformance criterion. Why did they do that? Say that you have this in your page: <meta encoding="utf-8"> This will not do as you intend, because no browser will understand the encoding attribute. The new encoding hack works because it squats on liberal parsing output from the text/html content type charset parameter. This would work: <meta charset="utf-8"> So in the syntax, we can say that charset="utf-8" is valid, and encoding="utf-8" is invalid. What we really mean is that charset is implemented in some code somewhere and produces an effect when used, whereas encoding produces no effect. The concept of "produces no effect" has been twisted into "invalid". Consider the following: <meta custom="utf-8"> What is this? This is not so obviously an intention to set an encoding for the present document. It looks more like some attribute to be slurped up by some unknown code. This is dangerous practice, because in future a more popular piece of code may use the same syntax to produce a different effect. The supposed value of syntax validation here is that by flagging custom as invalid, the author of the document is made to see how pernicious their attribute is. But what does the author of the page need to know? The author of the page doesn't need to know that custom is a custom attribute. They already know that. You can bark at them that this is invalid, but they added it deliberately in the first place, so they're not actually going to care. They would want to know if they wrote this accidentally: <meta custmo="utf-8"> Where they intended to write "custom" but wrote the typo "custmo" with the accidental metathesis of the two characters "o" and "m". But to them, the custom attribute is valid-but-proprietary and the custmo attribute is invalid, which is to say it has no effect in their software. They might find this by using their software, but their software may only produce misleading results, not obviously broken results. To anyone else, both attributes are equally meaningless, they produce no effect in any software they know about. People don't need validators to moralise at them, they need validators to provide them information. That an attribute is not implemented in any major browser or other HTML software is information. That an attribute is invalid is moralisation, of a sort. If in fact a custom attribute suddenly appears in a major browser, at that point you can be sure the author of the document will care about it! This example is expressed in terms of syntax, but the point is that the effect is what matters. The DOM is important because the DOM is the first obvious effect of syntax. The first substantive thing a browser does is convert some input into a DOM, then it works with the DOM. In fact, the charset attribute is used as an example here because it is one of the few things which doesn't really concern the DOM as such, so it's an easier example to give. Consider, then, something more obviously DOM oriented: <p>This is a <em>very <strong>good</em> example</strong>. This is not valid HTML, but it works in any decent browser. You can predict what sort of structure you're going to get out of it quite easily. In visual terms, "very" will be italic, "good" will be bold and italic, and "example" will be bold. What kind of DOM structure do you get though? Either of these look sensible, for example: 1. <em>very <strong>good</strong></em><strong> example</strong> 2. <em>very </em><strong><em>good</em> example</strong> Quite possibly you get different DOMs from different browsers. You might want to know this if you're interacting with the emphasis in some way, such as if you have a script to make a poem interactive. If you don't have such interaction, you won't care, and it will be valid no matter what. If you do care, you don't so much care which of the DOMs will be the result, so much as whether browsers will be consistent and your interactive code will work across all browsers. This is information. That validation should be DOM oriented, therefore, does not mean that the syntax can be wholly disregarded. What we're interested in is always the effects of the concrete syntax. The DOM comes from a transformation of the syntax to a structure. We may, but don't always, want to know for example whether this transformation is standardised across all browsers. We may, but don't always, want to know the actual result of the transformation on the syntax that we feed it. We don't ever care whether the syntax matches some syntax specification, but only in regards to the reason behind the syntax specification being as it is. The roots of this mistake lie in basing HTML on SGML, which was a tremendous error which has now been recognised. This was recognised in baby steps. First, the problem was thought to lie in the complexity of SGML, so XML was made. Then the problem was thought to lie in the complexity of XML, so Bray devised an XML without processing instructions and other superfluous features. Meanwhile, people who actually have to use HTML were making things like Markdown and Textile to make authorship more easy, pointing to what HTML should have been like in the first place. But there are benefits to a consistent structure, it's just that even the subset XML syntax didn't go far enough. The biggest mistake that the subset XML made was that, again, the input was too fragile. There could be byte strings which were not valid in this language, and that concept of invalidity was tied to a draconian user agent error recovery process: if there is an error, abort the processing and do not render the content. This was supposed to make it easier on beleaguered implementors, but the implementors are few and the authors many. The WHAT WG's living, breathing, walking, talking, organic HTML specification (what do we call it now that "HTML5" is so passé?) does have a processing model which is a first step along this route. The concept of a conformance checker in that specification is very outdated, for the reasons outlined above. But the error recovery process is well defined. The main problem with that is here: "The error handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must abort processing at the first error that they encounter for which they do not wish to apply the rules described below." What does it mean to "must" abort processing? User agent authors can and probably shall do as they please. They do not have to follow the error recovery process that the specification defines, but may in practice innovate if they need to. Such innovation should be passed back to the specification editor for possible inclusion. They may also not abort, but for example give warnings. This isn't covered by the specification as it stands. The very outdated notion of a conformance checker is the worse problem, but this is the sort of way in which its presence is still felt in practical terms. When you change the conformance checker, you change the tone of the language. If we had conformance checkers that works along the principles outlined herein, not only would they be more realistic and effective, but that effectiveness may seep into future evolutions of HTML itself. The DOM may be seen for the ogre in the room that it is, and gradually updated to be sleeker and more in line with tools that people actually use, such as jQuery or Prototype. The syntax to structure transformation, currently considered just a processing model, may become an SGML or XML like language in its own right, without their manifold obstreperous burrs. Just because the W3C have had their responsibility as keepers of HTML de facto abrogated due to so many factors, and HTML released into a new freedom as a result, there is no reason why Sturgeon's Law should not continue to operate in the language design domain. What I outline here attempts to confront that fact head on, instead of wrestling with it in such a way which only produces baby fractal sturgeon roe inside the mother sturgeon. I was going to send this to www-html@w3 and whatwg@whatwg, but I do not in fact intend to discuss this message further in email. Anyway, Spiderman found that www-archive is the most productive mailing list, where people may cause a constitutional crisis from mere unprofessionalism. I may be available as sbp in #swhack on freenode should anyone find a strong desire to berate me. -- Sean B. Palmer, http://inamidst.com/sbp/
Received on Saturday, 12 March 2011 12:42:41 UTC