Re: What would count as an unbiased survey? from Robin Berjon on 2009-05-29 (www-tag@w3.org from May 2009)

From: Robin Berjon <robin@berjon.com>
Date: Fri, 29 May 2009 16:59:47 +0200
To: noah_mendelsohn@us.ibm.com
Cc: "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-tag@w3.org
Message-Id: <4D87FBE6-4454-4C7A-96CB-D9658E9FE1B3@berjon.com>
Hi Noah,

On May 29, 2009, at 16:14 , noah_mendelsohn@us.ibm.com wrote:
> Robin Berjon wrote:
>> Or more precisely, that a lot of languages are defined using
>> XML Schema — the distinction being that in a fair number of
>> cases that I've seen the schema may be in the specification
>> but then no one ever uses it as part of the production chain.
>
> Do you mean XSD is not used in production for validation or also  
> that it's
> not used by tooling.

Sorry, I was indeed unclear and failed to state the bias in my  
sampling. I actually mean not used *at all*. A lot of the usage I see  
is rooted in the mobile or TV industry (the latter having even more  
constrained devices), or around rather simple data format (of the kind  
that is used in the Widgets Packaging and Configuration specification,  
or similar rather straightforward formats).

Things may be improving (slowly), but in a lot of the cases here there  
is no data binding, and no validation at any step. People will  
generally proceed with more general testing that will catch validity  
errors in the XML at the processor-level, but without relying on  
schema-based validation. I don't have numbers to back this up, but I'm  
not pulling it out of thin air either: I've seen a shockingly large  
number of schemata extracted from specifications (by SDOs or  
customers) that either weren't accepted by schema validators, or  
didn't properly validate real content* (or in extreme cases weren't  
even well-formed XML) — and no one had noticed months into deployment.  
This is the community that I think we should address: people who use  
schemata mostly for documentation, and could really use usage guidance.

I think that there are several factors at play here. One is that we  
are talking about documents that are at least an order of magnitude  
simpler (by whichever measure) than those used for instance in the  
financial industry or in B2B, ERP, etc. This makes data binding less  
valuable, and the far lesser degree of language composability means  
that user agent validation tends to be less complex (yet more  
complete) than schema-based approaches.

Another is that existing schema languages haven't really been designed  
to operate on constrained devices. XML Schema is amenable to streaming  
but its complexity gets in the way; RelaxNG tends to take up too much  
memory (I haven't looked at other options). And in any case they slow  
things down. I've heard the argument that if you can do mobile video  
then surely you can do a bit of validation, but video has an impact on  
the user experience that validation lacks — thereby justifying the  
cost of research in faster software, special chips, etc. And not  
validating at the receiving endpoint tends to reduce the value of  
validating at all.

Then you have to take into account the fact that a lot of that data is  
produced on systems that don't have very potent or reliable schema  
implementations (at least not for XML Schema, RNG and Schematron tend  
to be more readily available). The vast tracts of information produced  
using PHP, Perl, Ruby, etc. are unlikely to use much validation or  
data binding. And that a huge part of the web, including less visible  
mobile bits.

I can't seem to dig it up now but I recall an article (I believe from  
Tim Bray) from about a decade ago in which he explained that people  
didn't use DTDs with XML — they mostly drew up a few examples  
documents and emailed them over. My experience is that there's still a  
large community that keeps working in pretty much the same way —  
except that they'll toss in a schema because it seems to be what's  
done, because it makes it look real and professional.

Please note that I'm dissing neither XML Schema nor people who use  
that approach. I just happen to think that we would probably provide  
better value by looking at what the people who couldn't tell a  
deterministic content model from the distinguished property value  
denoting it and don't want to are doing, where they're shooting  
themselves in the feet, what could be simplified, etc. than by  
debating miscellaneous technicalities differentiating schema languages  
(no matter who much fun that is).

* elementFormDefault probably accounting for a solid majority of such  
cases.

-- 
Robin Berjon - http://berjon.com/
     Feel like hiring me? Go to http://robineko.com/
Received on Friday, 29 May 2009 15:00:20 UTC