XML Query algebra

Paul Cotton, Microsoft Canada 
17 Eleanor Drive, Nepean, Ontario K2E 6A3 
Tel: (613) 225-5445 Fax: (425) 936-7329 
<mailto:pcotton@microsoft.com> 



-----Original Message-----
From: Murali Mani [mailto:mani@CS.UCLA.EDU]
Sent: Tuesday, December 05, 2000 2:36 PM
To: Paul Cotton; chamberl@almaden.ibm.com; mff@research.att.com;
danaF@crossgain.com; petsa@us.ibm.com; mmani@almaden.ibm.com
Cc: Murali Mani
Subject: XML Query algebra



I was studying a few aspects of XML Query Algebra. I would like to list
the four items for the perusal of the Query WG members.
I try not to use terms like "useful", "natural" etc, which is left for
more experienced people.
I would also like to mention that all these are my personal opinions, and
does not reflect anyone else's opinions in any way.

1. Order definition in the query data model - Order among child nodes,
document order, as well as order among nodes referenced using IDREFs are
all "useful". Order among IDREFs nodes cannot be captured by document
order. I believe there are numerous solutions possible, one possible
solution is define "queryable properties" for each element in the
document, and these properties will have the necessary order definitions.

2. Algebra: Should result of a path expression contain duplicate nodes? -
The present XML Query Algebra decomposes a path expression into multiple
"steps" and executes each step separately. There is no duplicate
elimination at the end. Therefore from where I stand, I beleve there will
be duplicates in the result we obtain. I am against this, and feel that
there should not be duplicates. There could be several reasons, I outline
a few I could think of:
a) If duplicates are desired, one can easily write multiple path
expressions.
b) If we write a path expression to determine whether a data definition is
cyclic (say part-subpart, parent-ancestor models), then the recursion will
not terminate in case of a cycle.
c) It conflicts with "simple" mathematical techniques like visit every
node, and determine whether that node satisfies the path expression.

3. No regular expression operators in the path expressions in algebra and
QL.
I believe presence of * will definitely help in recursion in case of a
recursive data definition such as part-subpart or parent-ancestor.

The fourth point is clouded with some uncertainty, and this could be why I
need so much space to explain one thing.

4. Absence of tree pattern match in the algebra, and QL.
After some deliberation, I have come to the following tentative definition
for tree pattern match:
TreePatternMatch (Pred1, Pred2, ... Predn) is one where Predi can be
matched by 0 or more nodes, say (Nodei1, Nodei2, ..., Nodein_i). Now tree
pattern match requires that (Nodelm \neq Nodepq) for all l, m, p, q, where
... (eliminate trivial case).

Why do we suddenly need this feature unheard of in the DB community
before?
1. Suppose we have a type definition such as book, and an instance of this
b1 --
  a) RDBs allow type definitions for book as book -> (title, author,
publisher)
  b) XML allows type definitions as book -> (title, author*, publisher)

Now in RDBs, b1.author (or anything) refers to one value, but in XML
b1.author refers to multiple values.

Now there can be two scenarios in XML in a query like
b1 [author = "" and author = ""] --
  The user might or might not insist that the two authors be distinct. The
scenario where the two authors are distinct is what is assisted by tree
pattern match as I define.

I have heard oppositions that it cannot be expressed etc, I think these
are slightly lame, because we can always use something like what I used
above.

I believe that a decision on the "usefulness" or "importance" of tree
pattern match should be taken after some more careful study. From where I
stand, it appears that it has not been studied at all by the members of
the Query WG I know. (I also say with sincerity that I am also not
familiar with the literature in tree pattern match, which I believe has
quite a bit of strong math behind it.)

sample references are:
a) caterpillar automata by Anne, in the Jl - Markup Technologies.
b) a Ph.D thesis referred to in the above paper.

also, I believe that 1, 2 and 4 are not evolutionary features, though 3 I
consider it to be evolutionary.

regards - murali.

Received on Tuesday, 5 December 2000 15:39:27 UTC