- From: Bijan Parsia <bparsia@cs.manchester.ac.uk>
- Date: Sun, 15 Feb 2009 21:50:06 +0000
- To: www-tag@w3.org
(I'm replying to Jeff Sonstein's email: <http://lists.w3.org/Archives/Public/www-tag/2009Feb/0043.html>* I'm not on the TAG list so replies that one would like me to respond to in a timely manner should be cced to me.) I also teach XML and related technologies (at both the graduate and undergraduate level) at the University of Manchester, and before that at the University of Maryland, College Park -- both institutions with strong CS departments and student bodies. In one graduate course (with the majority of students being PhD students), the students were supposed to produce a simple RDF/XML document. The majority of the submissions (IIRC) were not well formed XML. I've just had the experience of students handing in Java files that were not compilable (due to multiple syntax errors) although they managed to submit .class files as well (we're still trying to figure that out). We had students hand in DTDs (which have, I'd guess, a simpler syntax overall) written using oXygen which were syntactically incorrect. In my personal experience, XML datasets found in the wild can be quite broken, at least, when I download them. I don't know how it is now, but I had tremendous difficulties with DBLP's XML dump. Strange characters all over the place. Needed quite a lot of massaging to hit well-formed. To pull some older surveys I happen to have on hand (which I didn't seek out for data on document correctness): http://webdb2004.cs.columbia.edu/papers/6-1.pdf """It was a bit disappointing to notice that a relatively large fraction of the XSDs we retrieved did not pass a conformance test by SQC. As mentioned in Section 2, only 30 out of a total of 93 XSDs were found to be adhering to the current specifications of the W3C [17]. Often, lack of conformance can be attributed to growing pains of an emerging technology: the SQC validates according to the 2001 specification and 19 out of the 93 XSDs have been designed according to a previous specification. Some simple types have been omitted or added from one version of the specs to another causing the SQC to report errors. Some errors concern violation of the Datatypes part of the specification [1]: regular expressions restricting xsd:string are malformed. Some XSDs violate the XML Schema specification by e.g. specifying a type attribute for a complexType element or leaving out the name attribute for a top-level complexType element.""" http://db.cis.upenn.edu/DL/dtds.pdf """"As mentioned previously, the first striking observation is that most published DTDs are not correct, with missing elements, wrong syntax or incompatible attribute declarations. This might prove that such DTDs are being used for documentation purposes on- ly and are not meant to be used for validation at all. The reason might be that because of the ad-hoc syn- tax of DTDs (inherited from SGML), there are no standard tools to validate them. This issue will be addressed by proposals that use XML as the syntax to describe DTDs themselves (see Section 5).""" (I've some concerns with the methodology and analysis, but these are some sort of evidence.) In the OWL working group, we discovered recently that a number (around 190 I believe) test cases from the old spec were incorrectly formed relative to what the test described them to be (they each were missing *one triple* in the header to be OWL DL; both species validators passed them as OWL DL; these were collected from members of the old group). Users range wildly in capability, even from moment to moment as circumstances shift. It's a bit odd to claim that something, especially involving precise syntax with a fair number of dark corners, is easy for... users...without some more precise specification of the users and some reasonable evidence. I also don't think pejoratives or scorn or disbelief are helpful for understanding the situation. (Some of the messages here seem to be continuous with: http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim I understand this attitude, I truly do. That's how I felt after the 3rd RDF/XML submission wasn't well formed...) One needs to be quite careful to extrapolate from expert, out of context assessment of the difficulty of a technology to the general usability of that technology. Furthermore, one must take care to assess the addition cognitive load and, indeed, the general efficiency drain vs. gain. With the *best* circumstance, well-formedness or validity might be a boon (I like the autocomplete oXygen gives me!) But how difficult is it to get into the best circumstances and to stay there? There are opportunity costs involved. I, personally, am interested in XML5 like efforts, FWIW. Cheers, Bijan. * PS, Jeff, you write, "and it is not hard to learn to create valid *and* well-formed XML consistently". But validity is, to a first approximation, relative to a schema. That schema can describe fairly complex constraints and formats. So I don't think it's so very easy to aim for validity. Some of that aim will very strongly depend on the toolchain.
Received on Sunday, 15 February 2009 21:50:43 UTC