Re: HTML and XML from Bijan Parsia on 2009-02-16 (www-tag@w3.org from February 2009)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Mon, 16 Feb 2009 20:09:00 +0000
To: elharo@metalab.unc.edu
Cc: www-tag@w3.org
Message-Id: <9018A383-24C8-4F90-9A11-A02972A1B527@cs.man.ac.uk>
On 16 Feb 2009, at 15:30, Elliotte Harold wrote:

> Julian Reschke wrote:
>
>> In all cases though, *testing* the document using conforming  
>> software will highlight errors early on.
>
> People hand-editing XML, even experts, will make well-formedness  
> mistakes. Take that as a given.
>
> The same is true of people hand editing Java, C++, Perl, Haskell or  
> SQL.

This is, of course, the foundation of my burden of proof point.

> The difference is that these languages are routinely passed to  
> compilers or interpreters that rapidly reveal all syntax errors.  
> Nowadays we even use editors that reveal syntax errors as we type.

I'll also point out that programming in these languages is a  
specialist activity with high rewards. Even then, it would be  
interesting to see how broken failed projects are and how much time  
goes into syntax management:

	http://www.ppig.org/papers/12th-mciver.pdf
"""One approach to language comparison is to record the way students  
interact with the language and
environment – what kinds of mistakes they make, etc. - rather than  
attempting to measure language
independent knowledge at the end of the course.

Trivial syntax errors are frequently ignored in the literature  
(Spohrer and Soloway 1986, Spohrer,
Soloway, and Pope, 1985), perhaps because they are easily detected by  
the compiler/interpreter.  Van
Someren (1990) goes so far as to suggest that "The syntax of most  
programming languages (including
Prolog) is very limited and, in the initial stages only, a source of  
errors.", although many programmers
  will testify that the syntax of a language can continue to be a  
source of errors even for
experts, as in the case of "==" in C, which even expert C programmers  
occasionally write as "=".

Although they may be easily corrected, these errors can still be  
disruptive and frustrating.  Miller’s
(1956) discussion of the capacity of short term memory  (7 plus or  
minus 2 items) suggests that
experts have an advantage over novices, in that they make use of  
"chunking", where a single item in
short term memory may be, in effect, an entire algorithm. In  
contrast, novices are forced to deal with
details of syntax and individual elements of their algorithms,  
sometimes to the exclusion of the
broader picture of the problem at hand.  The more details a language  
requires the novice to recall, the
less space is available for algorithms and the problem itself.
"""

The cognitive overhead of well formedness can be negligible or severe  
when creating XML documents. Well formedness, of course, is rarely  
the *point* or the *interesting* set of constraints. It seems quite  
possible that it's more difficult than it needs to be.

I don't know. I've not found any data specifically on all this. I  
would like to take a stab or two at investigating it.

> Consequently syntax errors rarely make it into production (except  
> among college students of questionable honesty).

Blah. I'm not sure what the point of this comment was. However, in  
the context, it's not very nice.

> Is it annoying that the compilers can't autocorrect syntax errors?  
> Yes, it is; but we have learned from experience that when compilers  
> try to autocorrect syntax errors more often than not they get it  
> wrong.

 From experience? I would love to see the data. I know Interlisp's  
DWIM facility didn't "take off", but there could be many reasons. All  
I could easily find on this was:
	http://catless.ncl.ac.uk/Risks/7.13.html#subj3

which is not highly negative both in experience and in overall  
picture. (Bad DWIM clearly shall suck.)

> Fixing syntax errors at the compiler level leads to far more  
> serious, far more costly, and far harder to debug semantic errors  
> down the line.

Really? I just don't know. Some interpreted language environments  
miss lots of syntax errors until you hit that line of code during a run.

> Draconian error handling leads to fewer mistakes where the person  
> sitting at the keyboard meant one thing but typed another.

I've no idea, really.

> Syntax errors are one of the prices developers have to pay in order  
> to produce reliable, maintainable software. Languages have been  
> developed that attempt, to grater or lesser degrees, to avoid the  
> possibility of syntax error. They have uniformly failed.

Of course, we're not talking about avoiding the possibility of syntax  
error, but of how to cope with error.

One key difference between programs and data is that I often need to  
manipulate the data even if it has syntax errors. I usually end up  
doing that with text tools. How is that better than dealing with a  
structure that might be extracted? That's what ends up happening  
*anyway* a good deal of the time as I patch the errors so I can just  
*see* and *query* the thing.

> Although HTML and XML are less complex than Turing complete- 
> programming languages, I do not think they are sufficiently less  
> complex to make the lessons learned in Java, C, Perl, etc.  
> inapplicable. Attempts to auto-correct syntax errors will only  
> cause bigger, costlier, harder to debug problems further down the  
> road. We have already seen this with HTML. Today it is far easier  
> to develop and debug complex JavaScript and CSS on web pages by  
> starting with well-formed, valid XHTML. There's simply less to  
> infer about  what the browser is doing with the page.

Isn't the question not which is easier to program against. I totally  
prefer well formed XML etc. etc. I thought the issue was how best to  
cope with problem data and the prevalence of that problem data. The  
claim has been advanced that people (some people) can always, more or  
less, with relative ease, produce well formed XML and transport it in  
various ways to consumers over the Web.

This just doesn't seem to be true.

What we do about it is a different story.

> Even if HTML 5 brings us to a realm where there are no cross- 
> browser differences in object model--a state I doubt we'll see  
> though I'd be happy to be proved wrong--we'll still  be faced with  
> the reality that the code in front of the developer's face is not  
> the code the browser is rendering. Debugging problems with web  
> applications and web pages will require deep knowledge of HTML  
> error correction arcana. Tools will be developed to expose the  
> actual object model, but these tools will not be universally  
> available or used.

I don't dispute the relative ease, again. No one, to my knowledge,  
does. HTML5 has conformance/validity criteria and there are already  
validators. You will do well by producing valid HTML5, and better  
than having to cope with invalid.

But given the reality of invalid HTML5 and non-well-formed XML...how  
do we minimize the cost of the errors? How do we distribute the costs  
where they can be effectively borne?

> The simplest, least costly approach

Really. I don't see how you can have such confidence in that.

> is to pay a small cost upfront to maintain well-formedness and  
> reject malformed documents.

Often this is not (pragmatically speaking) possible or desirable.  
Often, it's not a small cost at all.

> Hand authors would quickly learn that you have to "compile" your  
> document before uploading it and fix any syntax errors that appear.

I've not see that quick learning, even within computer science, as my  
first message showed.

Also, if we expect XML to be used by broader populations in wider  
contexts, then this seems unrealistic.

> The cost savings for hand authors in future debugging and  
> development would be phenomenal.

I agree that *if* that were the case, hurrah. But between there and  
here there seem to be other places we can get. It makes sense to  
evaluate them carefully.

> Sadly, for various non-contingent reasons this hasn't happened with  
> HTML on the Web and seems unlikely to.  However I see no reason to  
> back away from well-formedness in all the other domains where it  
> achieves such colossal benefits.

It has pretty high costs. In a lot of circumstances.

I don't know why y'all ignore my DBLP example. It was real. I never  
ended up using the data, alas. I don't recall if I reported it, but,  
frankly, it was a clearly a significant challenge to fix. Perhaps  
that's just one price I must pay for people to have the colossal  
benefits.

> Error correcting parsers would be a step backwards. Until computers  
> become sufficiently smart to understand natural language (if they  
> ever do), well-formedness and draconian error handling are the best  
> tools we have for interfacing our language with theirs and avoiding  
> costly misunderstandings at the translation boundary.

Really? The *best* tools? I guess I'm nowhere near as pessimistic as  
you. Of course, perhaps my aim is lower: I just want to work with  
some data, y'know?

I'm not clear why one category of errors (well formedness ones) are  
so much worse than other levels (e.g., validity ones). They are all  
errors. One nice thing about XML is separating these classes of  
errors so that even if the document is not valid wrt the relevant  
schema, you can still work with it (transform it, etc.). What's so  
much worse about well formedness errors?

Sometimes it seems just kind of mean, e.g., "Well, ok, you don't have  
to be *valid*, but god damn it we DRAW THE LINE at these well  
formedness constraints!!!! Get it right!!"

This seems to get things wrong way round. What makes us think that  
well-formedness is the *right* class of constraints to be firm on?  
There could be more or there could be less such that the cost/benefit  
(instead of small/COLOSSAL perhaps we could hit tiny/MEGACOLOSSAL or  
medium/MEGASUPERDUPERCOLOSSALOMGTHESINGULARITY).

In a standards situation there are lots of different possible costs  
including opportunity costs. Perhaps we'll have to live with XML as  
it is. Perhaps we can do better. But surely it's better to  
investigate carefully, rather than make rather unsupported claims  
with colossal confidence :)

My original email was intended to provide some data. I pointed to two  
survey's and recounted some experiences I've had. I've also pointed  
out some methodological considerations and made some tentative  
conclusions.

To engage requires, at the very least, either acknowledging some  
common standards of evidence, or proposing some alternative ones, or  
critiquing the ones I've provided. That is, if we are interesting in  
finding stuff out.

Cheers,
Bijan.
Received on Monday, 16 February 2009 20:05:28 UTC