Re: SV: SV: SV: SV: Schema help from noah_mendelsohn@us.ibm.com on 2005-12-13 (xmlschema-dev@w3.org from December 2005)

From: <noah_mendelsohn@us.ibm.com>
Date: Mon, 12 Dec 2005 20:55:31 -0500
To: "Jack Lindsey" <tuquenukem@hotmail.com>
Cc: brs@itst.dk, mike@saxonica.com, petexmldev@tech-know-ware.com, xmlschema-dev@w3.org
Message-ID: <OFDB515A6D.4B86068B-ON852570D6.000858BE-852570D6.000A93E4@lotus.com>
I don't want to pursue this discussion too far, because I don't feel 
sufficiently expert in Schematron at the moment that I should be strongly 
implying that it's good or bad in any particular context.  I had a general 
awareness that it tends to be implemented using XSTL, which is definitely 
overhead I don't want to have in the middle of a high performance 
validator, but I want to research other implementation alternatives before 
drawing any firm conclusions about it.

Anyway, since you've asked a few specific questions I'll try and answer 
them.

Jack Lindsey writes:

> Do you not endorse reuse and the orthogonal alignment of W3C offerings?

Of course.  Whatever the other pros and cons of Schematron, its use of 
XPath and other existing Recs. is great. 

> Tim's RDF/Java example does not seem to be about the degree to 
> which you should use the features of a language but how the 
> objectives of a language can fundamentally impact your 
> flexibility when you try to repurpose data instances expressed 
> in them.  I am not seeing the parallel with validation.  can 
> you please enlighten me?

Sure.  Tim's point is that having something expressed in a more powerful 
language has drawbacks.  Let's take the case of so-called Turing Complete 
[1] languages including Java, C, JavaScript and XSLT.  From that Wikipedia 
article [1]:

"One important result from computability theory is that it is impossible 
in general to find if a program written in a Turing-complete language will 
continue executing forever or will stop within a finite period of time 
(see halting problem) [2]."

So, as one example of a concern, we cannot tell whether the transform 
performed by a given XSLT stylesheet will ever complete.  Contrast that 
with XML Schema as it exists today:  while you can build slower or faster 
implementations, I believe I'm correct in saying that there are known 
algorithms for doing the validation required by >any< possible schema in a 
time that is bounded, and usually quite tractable.  There is still some 
work being done on nested occurrence counts, but even those now seem to 
have practical implementations.

Furthermore, Tim's point is that with a less powerful language, you are 
more likely to be able to reason about a constraint.  If I have:

        <xsd:choice>
                <xsd:element ref="a"/>
                <xsd:element ref="b"/>
        </xsd:choice>

it's very easy to build tools that understand that this choice will accept 
exactly one element, named "a" or "b".  With the above constraint, it's 
easy to build a UI widget that will automatically prompt for an A or a B 
based on the schema.  If, by contrast, I wrote a Java or XSLT program to 
test for an "a" or a "b", it might be a simple program but it might in 
principle be a very complex one.  It's much harder to read through the 
code for that Java program or XSLT stylesheet and see what it does.  In 
fact, we can't in general tell mechanically whether an arbitrary Java or 
XSLT program will even reach the point of making a decision about what's 
valid.  Building the UI widget by reading the program code is similarly 
difficult.

I'm making at least three points here, two of which I'm fairly confident 
of and one of which I'm not sure of at all:

1. Performance matters in validation.  That's my opinion, but I hold it 
strongly.  I want to use validation in high performance web services 
scenarios, and I believe I know how to do it by optimizing today's schema 
language.  I want to retain that capability when and if we add 
co-constraints.

2. I think the principle of least power is important, for some of the 
reasons explained above.  It will be better to find a simple declarative 
approach that hits an 80/20 point than to implement something that's more 
general but hard to reason about.

3. (potentially unfounded suspicion) Maybe Schematron suffers from being 
too powerful and too slow, or maybe it doesn't.  This is the one I'm not 
sure of at all, and I want to learn more before casting aspersions. 
Certainly a generalized XSLT implementation is way slower than I want 
while processing hundreds or thousands of SOAP messages per second.  I 
have skimmed the Schematron spec, but the problem with least power 
arguments is that even one small feature off in the corner can make you 
Turing complete or otherwise difficult to reason about.  I don't know 
Schematron well enough to be sure. 

Thanks.

Noah

[1] http://en.wikipedia.org/wiki/Turing_complete
[2] http://en.wikipedia.org/wiki/Halting_problem


--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Tuesday, 13 December 2005 01:55:43 UTC