Re: [xml-dev] The use of XML syntax in XML Query from Joe English on 2002-01-04 (www-xml-query-comments@w3.org from January 2002)

From: Joe English <jenglish@flightlab.com>
Date: Thu, 03 Jan 2002 17:28:59 -0800
To: "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>, www-xml-query-comments@w3.org, xml-dev@lists.xml.org
Message-Id: <200201040128.RAA30467@dragon.flightlab.com>

Dare Obasanjo wrote:

> I'm still waiting for someone with a theoretical background to debunk the
> statements in the paper "On Database Theory and XML"[0] that it is is an
> unsolvable problem to guarantee that one can create a query that for any
> given XML input will generate output that conforms to a specified XML schema.
>
> [0] http://www.cs.washington.edu/homes/suciu/files/_F2066943700.ps

That paper has an interesting example.  Paraphrasing some,
given the input DTD:

	<!ELEMENT root (elm*)>

the query:

	element result {
	    for $x in //elm RETURN element a { $x/text() },
	    for $x in //elm RETURN element b { $x/text() },
	    for $x in //elm RETURN element c { $x/text() }
	}

and the output DTD:

	<!ELEMENT result (
		( (a,a)*, (b,b)*, (c,c)* )
	    |   ( a,(a,a)*, b,(b,b)*, c,(c,c)* ) ) >

(which says that the number of a's, b's, and c's all have
the same parity, i.e., either an even number of each or an
odd number of each), then the query will always produce a
valid output given a valid input.  However, no type system
based on regular grammars will be able to deduce this fact.
The best a type inferencing algorithm will be able to come
up with given the input DTD and query is:

	<!ELEMENT result (a*, b*, c*)>

which is a proper superset of the output DTD.  The *real*
language generated by the query is

	element result { a*n, b*n, c*n : n >= 0 }

(where "a*n" means, roughly, <element name="a" minOccurs="n"
maxOccurs="n"/>), which is a proper subset of the output DTD
(so the query is type-safe) but this language is not expressible
in a regular grammar.

Basically this means that any type inferencing algorithm
for XQuery (or an XQuery-like language) based on DTDs (or a
DTD-like type system) cannot be complete: it will always
reject some type-safe queries.

That doesn't mean that a *sound* type inferencing algorithm
isn't possible, i.e., one that produces no false positives
but may answer "no", "maybe", or "I don't know" on some
correct inputs.  Personally I don't think this limitation
is much of a problem in practice: ML and Haskell have the
same property and their type systems are still tremendously
useful.

The astute reader will note that the example output DTD is not
deterministic.  It's not clear whether requiring deterministic
content models helps the problem any; my guess is that it doesn't.

As for Suciu's claim that "When one adds joins, typechecking
becomes undecidable" [p. 5]:  I'm not quite up to debunking it,
but it sounds bogus to me; regular languages are closed under
cartesian product and homomorphisms, which is basically what
a join gives you.

--Joe English

  jenglish@flightlab.com

Received on Thursday, 3 January 2002 20:29:44 UTC