RE: Joins with groupby (No example in W3 Draft) from Houman Khorasani on 2004-06-17 (www-ql@w3.org from April to June 2004)

From: Houman Khorasani <khorasani@web.de>
Date: Thu, 17 Jun 2004 00:38:44 -0500
To: 'Michael Kay' <mhk@mhk.me.uk>, www-ql@w3.org
Message-id: <01LBDUYS7DFOAFYHCA@uwplatt.edu>
Hello everyone,

Michael and Donald thank you for your response.

Well I tried very hard now in the third day in the row to realize the SQL
like 'Group by' in XQuery.  And I think it is not possible in my case.

Grouping in XQuery with a for and let clause is meant to be for functions
like avg(), sum(), round() etc.  These functions must have one value as a
return but in my case this won't work.


Let me please explain this with this example:

http://www.houmie.com/research/pic1.JPG

On this picture you can see two relational tables:  
wisc_berkeley_tracelist with two columns ip and rID

The other table is wisc_berkeley_tracetime with three columns dat, min and
rID.

As you can see rID is the join field between the two tables.


The SQL query is:

SELECT
	wisc_berkeley_tracetime.rID,
	ip,
	min
FROM
	wisc_berkeley_tracelist INNER JOIN wisc_berkeley_tracetime

	ON wisc_berkeley_tracelist.rID = wisc_berkeley_tracetime.rID
WHERE
	(wisc_berkeley_tracetime.dat = '20000801000000')



The same query in XQuery is:

for $wbtl in doc("wisc_berkeley_tracelist.xml")//row
for $wbtt in doc("wisc_berkeley_tracetime.xml")//row[dat = '20000801000000'
and rID = $wbtl//rID]
return
	<ROW> 
	{	
		($wbtl/ip, $wbtt/rID, $wbtt/min)
	}
	</ROW>


The result of the XQuery statement above is exactly the same as the SQL
statement:  http://www.houmie.com/research/saxonResult.xml

These are 275 records.

You can see the following behavior in the result table of the same SQL
statement: http://www.houmie.com/research/pic3.JPG

At the rID 662 and minute 79 you can see several different IPs.  At minute
168 the same IP's in the row show up again.  It is possible to group by the
IP's in SQL (GROUP by ip).  

What does it mean?  http://www.houmie.com/research/pic2.JPG

Every IP is now unique and show up just once in the result table.  

How does it work?  It shows the first unique IP and at what minute it shows
up.  E.g. the IP 128.32.0.109 shows up for the first time at minute 79.  The
ip 128.32.0.66 does also show up for the first time at minute 79 and so on.
The records are reduced from 275 to 30.  

I tried to solve it in XQuery in this way:
for $wbtl_ip in distinct-values(doc("wisc_berkeley_tracelist.xml")//ip)
for $wbtt in doc("wisc_berkeley_tracetime.xml")//row[dat = '20000801000000']
for $wbtl in doc("wisc_berkeley_tracelist.xml")//row[rID = $wbtt/rID and ip
= $wbtl_ip]

return
	<ROW> 
	{	
		($wbtl/ip, $wbtl/rID)
	}
	</ROW>

But the result is the same just slower to proceed.  I also tried also an
outer-join and many other different combinations.

The main problem lies in distinct-values().  This function can only work on
atomic-values but not on the nodes it self.  If I could make an
distinct-value() on the NODE where the ip is unique e.g. ->  (the following
statement is not possible but a fiction)

for $wbtl_ip in distinct-values(doc("wisc_berkeley_tracelist.xml")//row, ip)

I would have the right tuples with a unique ip and the appropriate rID which
can be used for an inner-join on the other table.  This would be more like a
real SQL join and SQL group by.  Maybe there is a better way to translate
SQL's 'Group by'.

Otherwise I argue the above query and similar SQL query statements are not
possible in XQuery.

I appreciate any comments.

Houman Khorasani
University of Wisconsin Platteville
Received on Thursday, 17 June 2004 01:38:52 UTC