Re: What would count as an unbiased survey? from noah_mendelsohn@us.ibm.com on 2009-05-29 (www-tag@w3.org from May 2009)

From: <noah_mendelsohn@us.ibm.com>
Date: Fri, 29 May 2009 12:43:47 -0400
To: Robin Berjon <robin@berjon.com>
Cc: "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-tag@w3.org
Message-ID: <OF803F26BB.33CCD14E-ON852575C5.005900AF-852575C5.005BB6DC@lotus.com>

Ah, so you were looking primarily at mobile users, with simple documents
and typically very small devices. That's useful datapoint, but we do
seem to be spending a lot of energy reasoning from the negative. Nobody
doubts that there are lots of communities in which XSD is not used or
misused. As you pointed out, this same community often doesn't bother to
check XML well-formedness, but that doesn't do much to make the case that
XML itself isn't widely used, or that the well-formedness requirement
doesn't have value and isn't enforced in many other communities.

This all feels a bit like trying to say (and I know you're not saying this
Robin, but thread has this feel): "gee, do we really know whether C++ is
widely used? I don't see a lot of it on LISP machines, and by the way,
can you please define 'widely used' and come up with objective measures?"

You can always find important communities that don't use some technology
or other. I would posit that by most useful >subjective< measures C++ is
widely used, and not occasionally misused. Full stop. Is it really worth
debating the troubles in pinning down objective measures? For a less
widely use language like, say, Euler [1] I agree; you have to do some real
digging to see where the use is. C++ has, IMO, crossed the chasm. You no
longer need to debate whether it has a large user community. We all know
it does. Any evidence to the contrary would be presumed suspect. By
quite similar measures I would argue that XML itself is widely used, full
stop.

In a similar subjective spirit, I just don't see why we're having this
debate about XSD. It's not as widely used as C++ or XML itself, but I
believe that in spirit it too has crossed the chasm. It's used in every
WSDL; I strongly believe that it's used as input to lots of widely used
databinding tools, which in turn generate code that's widely deployed in
production. There's lots of annecdotal evidence of widespread adoption,
such as what Henry's provided. It's obviously used as a documentation
standard. It's the type system for XSL 2.0 and XQuery, use of which is I
think starting to rise. Furthermore, to avoid the appearance of hyping my
own companies products, I'll point to my competitor's as evidence of
significant continued investment [2] in XSD. Does anyone really think
that a company like Microsoft would make what appears to be a major
2008/2009 investment in graphical tools for building and manipulating XSD
schemas if they didn't know that there was, by some subjective measure, a
very large user community for that technology?

As Henry Thompson said:

> The _only_ reason for pursuing this question is to rebut the
> proposition, often advanced but not, to my knowledge, ever
> substantiated, that W3C XML Schema is not used very much, so e.
> g. delaying the next version is not a big deal.

Are other schema languages also widely used? No doubt (subjective
conclusion). Much more widely than XSD? For purposes of this discussion,
I don't care (though I doubt it.) Is there anyone in this thread really
seriously saying that XSD, a W3C Recommendation, is so rarely used that
>such lack of use is the reason for not going ahead with an improved
version of the Recommendation<?

I understand that Rick has raised >other< reasons for not updating XSD,
which I'll oversimplify as "it's a distraction from the important business
of admitting that XSD is a bad language and that we should be working on
either cleanup or replacement". I happen to disagree, but it's
appropriate to consider those points on the merits. To say, however, that
there is no significant XSD user community who might benefit from these
enhancments flies in the face of all the evidence I've seen. From what
I've seen on the schema-dev mailing list, there is also good evidence that
many of the XSD 1.1 enhancements will in fact be useful to current XSD
users.

Let's please just move ahead with the review of the XSD 1.1 CR, and
evaluation of implementation experience. Rick's concern has, I believe,
been registered as a comment on XSD 1.1 [3], and I expect that will cause
it to get due consideration as part of the W3C process.

Noah

[1] http://en.wikipedia.org/wiki/Euler_(programming_language)
[2]
http://blogs.msdn.com/xmlteam/archive/2007/08/27/announcing-ctp1-of-the-xml-schema-designer.aspx

[3] http://www.w3.org/Bugs/Public/show_bug.cgi?id=6940

--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Robin Berjon <robin@berjon.com>
05/29/2009 10:59 AM

To: noah_mendelsohn@us.ibm.com
cc: "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-tag@w3.org
Subject: Re: What would count as an unbiased survey?

Hi Noah,

On May 29, 2009, at 16:14 , noah_mendelsohn@us.ibm.com wrote:
> Robin Berjon wrote:
>> Or more precisely, that a lot of languages are defined using
>> XML Schema — the distinction being that in a fair number of
>> cases that I've seen the schema may be in the specification
>> but then no one ever uses it as part of the production chain.
>
> Do you mean XSD is not used in production for validation or also
> that it's
> not used by tooling.

Sorry, I was indeed unclear and failed to state the bias in my
sampling. I actually mean not used *at all*. A lot of the usage I see
is rooted in the mobile or TV industry (the latter having even more
constrained devices), or around rather simple data format (of the kind
that is used in the Widgets Packaging and Configuration specification,
or similar rather straightforward formats).

Things may be improving (slowly), but in a lot of the cases here there
is no data binding, and no validation at any step. People will
generally proceed with more general testing that will catch validity
errors in the XML at the processor-level, but without relying on
schema-based validation. I don't have numbers to back this up, but I'm
not pulling it out of thin air either: I've seen a shockingly large
number of schemata extracted from specifications (by SDOs or
customers) that either weren't accepted by schema validators, or
didn't properly validate real content* (or in extreme cases weren't
even well-formed XML) — and no one had noticed months into deployment.
This is the community that I think we should address: people who use
schemata mostly for documentation, and could really use usage guidance.

I think that there are several factors at play here. One is that we
are talking about documents that are at least an order of magnitude
simpler (by whichever measure) than those used for instance in the
financial industry or in B2B, ERP, etc. This makes data binding less
valuable, and the far lesser degree of language composability means
that user agent validation tends to be less complex (yet more
complete) than schema-based approaches.

Another is that existing schema languages haven't really been designed
to operate on constrained devices. XML Schema is amenable to streaming
but its complexity gets in the way; RelaxNG tends to take up too much
memory (I haven't looked at other options). And in any case they slow
things down. I've heard the argument that if you can do mobile video
then surely you can do a bit of validation, but video has an impact on
the user experience that validation lacks — thereby justifying the
cost of research in faster software, special chips, etc. And not
validating at the receiving endpoint tends to reduce the value of
validating at all.

Then you have to take into account the fact that a lot of that data is
produced on systems that don't have very potent or reliable schema
implementations (at least not for XML Schema, RNG and Schematron tend
to be more readily available). The vast tracts of information produced
using PHP, Perl, Ruby, etc. are unlikely to use much validation or
data binding. And that a huge part of the web, including less visible
mobile bits.

I can't seem to dig it up now but I recall an article (I believe from
Tim Bray) from about a decade ago in which he explained that people
didn't use DTDs with XML — they mostly drew up a few examples
documents and emailed them over. My experience is that there's still a
large community that keeps working in pretty much the same way —
except that they'll toss in a schema because it seems to be what's
done, because it makes it look real and professional.

Please note that I'm dissing neither XML Schema nor people who use
that approach. I just happen to think that we would probably provide
better value by looking at what the people who couldn't tell a
deterministic content model from the distinguished property value
denoting it and don't want to are doing, where they're shooting
themselves in the feet, what could be simplified, etc. than by
debating miscellaneous technicalities differentiating schema languages
(no matter who much fun that is).

* elementFormDefault probably accounting for a solid majority of such
cases.

--
Robin Berjon - http://berjon.com/

Feel like hiring me? Go to http://robineko.com/

Received on Friday, 29 May 2009 16:42:25 UTC