- From: Joe English <jenglish@flightlab.com>
- Date: Thu, 03 Jan 2002 17:28:59 -0800
- To: "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>, www-xml-query-comments@w3.org, xml-dev@lists.xml.org
Dare Obasanjo wrote: > I'm still waiting for someone with a theoretical background to debunk the > statements in the paper "On Database Theory and XML"[0] that it is is an > unsolvable problem to guarantee that one can create a query that for any > given XML input will generate output that conforms to a specified XML schema. > > [0] http://www.cs.washington.edu/homes/suciu/files/_F2066943700.ps That paper has an interesting example. Paraphrasing some, given the input DTD: <!ELEMENT root (elm*)> the query: element result { for $x in //elm RETURN element a { $x/text() }, for $x in //elm RETURN element b { $x/text() }, for $x in //elm RETURN element c { $x/text() } } and the output DTD: <!ELEMENT result ( ( (a,a)*, (b,b)*, (c,c)* ) | ( a,(a,a)*, b,(b,b)*, c,(c,c)* ) ) > (which says that the number of a's, b's, and c's all have the same parity, i.e., either an even number of each or an odd number of each), then the query will always produce a valid output given a valid input. However, no type system based on regular grammars will be able to deduce this fact. The best a type inferencing algorithm will be able to come up with given the input DTD and query is: <!ELEMENT result (a*, b*, c*)> which is a proper superset of the output DTD. The *real* language generated by the query is element result { a*n, b*n, c*n : n >= 0 } (where "a*n" means, roughly, <element name="a" minOccurs="n" maxOccurs="n"/>), which is a proper subset of the output DTD (so the query is type-safe) but this language is not expressible in a regular grammar. Basically this means that any type inferencing algorithm for XQuery (or an XQuery-like language) based on DTDs (or a DTD-like type system) cannot be complete: it will always reject some type-safe queries. That doesn't mean that a *sound* type inferencing algorithm isn't possible, i.e., one that produces no false positives but may answer "no", "maybe", or "I don't know" on some correct inputs. Personally I don't think this limitation is much of a problem in practice: ML and Haskell have the same property and their type systems are still tremendously useful. The astute reader will note that the example output DTD is not deterministic. It's not clear whether requiring deterministic content models helps the problem any; my guess is that it doesn't. As for Suciu's claim that "When one adds joins, typechecking becomes undecidable" [p. 5]: I'm not quite up to debunking it, but it sounds bogus to me; regular languages are closed under cartesian product and homomorphisms, which is basically what a join gives you. --Joe English jenglish@flightlab.com
Received on Thursday, 3 January 2002 20:29:44 UTC