Re: HTML and XML from Bijan Parsia on 2009-02-15 (www-tag@w3.org from February 2009)

From: Bijan Parsia <bparsia@cs.manchester.ac.uk>
Date: Sun, 15 Feb 2009 21:50:06 +0000
To: www-tag@w3.org
Message-Id: <1A742A9C-C0BE-4789-9BD7-77838DDB4613@cs.manchester.ac.uk>
(I'm replying to Jeff Sonstein's email:
	<http://lists.w3.org/Archives/Public/www-tag/2009Feb/0043.html>*
I'm not on the TAG list so replies that one would like me to respond  
to in a timely manner should be cced to me.)

I also teach XML and related technologies (at both the graduate and  
undergraduate level) at the University of Manchester, and before that  
at the University of Maryland, College Park -- both institutions with  
strong CS departments and student bodies.

In one graduate course (with the majority of students being PhD  
students), the students were supposed to produce a simple RDF/XML  
document. The majority of the submissions (IIRC) were not well formed  
XML.

I've just had the experience of students handing in Java files that  
were not compilable (due to multiple syntax errors) although they  
managed to submit .class files as well (we're still trying to figure  
that out).

We had students hand in DTDs (which have, I'd guess, a simpler syntax  
overall) written using oXygen which were syntactically incorrect.

In my personal experience, XML datasets found in the wild can be quite  
broken, at least, when I download them. I don't know how it is now,  
but I had tremendous difficulties with DBLP's XML dump. Strange  
characters all over the place. Needed quite a lot of massaging to hit  
well-formed.

To pull some older surveys I happen to have on hand (which I didn't  
seek out for data on document correctness):

	http://webdb2004.cs.columbia.edu/papers/6-1.pdf

"""It was a bit disappointing to notice that a relatively large
fraction of the XSDs we retrieved did not pass a conformance
test by SQC. As mentioned in Section 2, only 30 out
of a total of 93 XSDs were found to be adhering to the current
specifications of the W3C [17].

Often, lack of conformance can be attributed to growing
pains of an emerging technology: the SQC validates according
to the 2001 specification and 19 out of the 93 XSDs have
been designed according to a previous specification. Some
simple types have been omitted or added from one version
of the specs to another causing the SQC to report errors.
Some errors concern violation of the Datatypes part of the
specification [1]: regular expressions restricting xsd:string
are malformed.

Some XSDs violate the XML Schema specification by e.g.
specifying a type attribute for a complexType element or leaving
out the name attribute for a top-level complexType element."""

http://db.cis.upenn.edu/DL/dtds.pdf

""""As mentioned previously, the first striking observation is
that most published DTDs are not correct,
with missing elements, wrong syntax or incompatible
attribute declarations. This might prove that such
DTDs are being used for documentation purposes on-
ly and are not meant to be used for validation at all.
The reason might be that because of the ad-hoc syn-
tax of DTDs (inherited from SGML), there are no
standard tools to validate them. This issue will be
addressed by proposals that use XML as the syntax
to describe DTDs themselves (see Section 5)."""

(I've some concerns with the methodology and analysis, but these are  
some sort of evidence.)

In the OWL working group, we discovered recently that a number (around  
190 I believe) test cases from the old spec were incorrectly formed  
relative to what the test described them to be (they each were missing  
*one triple* in the header to be OWL DL; both species validators  
passed them as OWL DL; these were collected from members of the old  
group).

Users range wildly in capability, even from moment to moment as  
circumstances shift. It's a bit odd to claim that something,  
especially involving precise syntax with a fair number of dark  
corners, is easy for... users...without some more precise  
specification of the users and some reasonable evidence.

I also don't think pejoratives or scorn or disbelief are helpful for  
understanding the situation. (Some of the messages here seem to be  
continuous with:
	http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim
I understand this attitude, I truly do. That's how I felt after the  
3rd RDF/XML submission wasn't well formed...)

One needs to be quite careful to extrapolate from expert, out of  
context assessment of the difficulty of a technology to the general  
usability of that technology.

Furthermore, one must take care to assess the addition cognitive load  
and, indeed, the general efficiency drain vs. gain. With the *best*  
circumstance, well-formedness or validity might be a boon (I like the  
autocomplete oXygen gives me!) But how difficult is it to get into the  
best circumstances and to stay there? There are opportunity costs  
involved.

I, personally, am interested in XML5 like efforts, FWIW.

Cheers,
Bijan.

* PS, Jeff, you write, "and it is not hard to learn to create valid  
*and* well-formed XML consistently". But validity is, to a first  
approximation, relative to a schema. That schema can describe fairly  
complex constraints and formats. So I don't think it's so very easy to  
aim for validity. Some of that aim will very strongly depend on the  
toolchain.
Received on Sunday, 15 February 2009 21:50:43 UTC