Re: HTML and XML from Elliotte Harold on 2009-02-16 (www-tag@w3.org from February 2009)

From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Mon, 16 Feb 2009 07:30:29 -0800
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Bijan Parsia <bparsia@cs.man.ac.uk>, Bijan Parsia <bparsia@cs.manchester.ac.uk>, www-tag@w3.org
Message-ID: <49998695.4010106@metalab.unc.edu>

Julian Reschke wrote:

> In all cases though, *testing* the document using conforming software 
> will highlight errors early on.

People hand-editing XML, even experts, will make well-formedness 
mistakes. Take that as a given.

The same is true of people hand editing Java, C++, Perl, Haskell or SQL.

The difference is that these languages are routinely passed to compilers 
or interpreters that rapidly reveal all syntax errors. Nowadays we even 
use editors that reveal syntax errors as we type. Consequently syntax 
errors rarely make it into production (except among college students of 
questionable honesty).

Is it annoying that the compilers can't autocorrect syntax errors? Yes, 
it is; but we have learned from experience that when compilers try to 
autocorrect syntax errors more often than not they get it wrong. Fixing 
syntax errors at the compiler level leads to far more serious, far more 
costly, and far harder to debug semantic errors down the line. Draconian 
error handling leads to fewer mistakes where the person sitting at the 
keyboard meant one thing but typed another.

Syntax errors are one of the prices developers have to pay in order to 
produce reliable, maintainable software. Languages have been developed 
that attempt, to grater or lesser degrees, to avoid the possibility of 
syntax error. They have uniformly failed.

Although HTML and XML are less complex than Turing complete-programming 
languages, I do not think they are sufficiently less complex to make the 
lessons learned in Java, C, Perl, etc. inapplicable. Attempts to 
auto-correct syntax errors will only cause bigger, costlier, harder to 
debug problems further down the road. We have already seen this with 
HTML. Today it is far easier to develop and debug complex JavaScript and 
CSS on web pages by starting with well-formed, valid XHTML. There's 
simply less to infer about  what the browser is doing with the page.

Even if HTML 5 brings us to a realm where there are no cross-browser 
differences in object model--a state I doubt we'll see though I'd be 
happy to be proved wrong--we'll still  be faced with the reality that 
the code in front of the developer's face is not the code the browser is 
rendering. Debugging problems with web applications and web pages will 
require deep knowledge of HTML error correction arcana. Tools will be 
developed to expose the actual object model, but these tools will not be 
universally available or used.

The simplest, least costly approach is to pay a small cost upfront to 
maintain well-formedness and reject malformed documents. Hand authors 
would quickly learn that you have to "compile" your document before 
uploading it and fix any syntax errors that appear. The cost savings for 
hand authors in future debugging and development would be phenomenal.

Sadly, for various non-contingent reasons this hasn't happened with HTML 
on the Web and seems unlikely to.  However I see no reason to back away 
from well-formedness in all the other domains where it achieves such 
colossal benefits. Error correcting parsers would be a step backwards. 
Until computers become sufficiently smart to understand natural language 
(if they ever do), well-formedness and draconian error handling are the 
best tools we have for interfacing our language with theirs and avoiding 
costly misunderstandings at the translation boundary.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
Refactoring HTML Just Published!
http://www.amazon.com/exec/obidos/ISBN=0321503635/ref=nosim/cafeaulaitA

Received on Monday, 16 February 2009 15:31:06 UTC