Summary of DISTINCT issue

This is pursuant to LeeF's request to summarize DISTINCT/LOOSE/ALL.
There should be no contentious opinions in this message.


The commenter who started this thread sited modifer-limit [ML]. I
am using a variant here:

Data:
  @prefix : <http://example.org/ns#> .
  :x :num 1 .
  :x :num 2 .
  :y :num 1 .
  :z :num 1 .

Query:
  PREFIX : <http://example.org/ns#>
  SELECT ?num
  WHERE { [] :num ?num }
  ORDER BY ASC(?num) <some DISTINCTion> ... LIMIT 3

Results vary by DISTINCTness semantics:
 ALL   DISTINCT          LOOSE            CHOOSE
 num     num       num    num    num    num    num
 "1"     "1"       "1" or "1" or "1"    "1" or "1"
 "1"     "2"       "1"    "1"    "2"    "1"    "2"
 "1"               "1"    "2"           "1"       


 ALL:
  most like default SQL semantics (though unlike SQL UNION).
  - most verbose/computationally exhaustive.
  + allows post-processing aggregates.
  + encourages implementors to be ready for aggregate additions to SPARQL.

 DISTINCT:
  very much like SQL DISTINCT.
  + least verbose.
  + clear semantics.
  - requires hashing of sent values (less added cost when used with ORDER).

 LOOSE:
  + good enough for most queries.
  + optimizable for hashing/transmission tradeoffs.
  - contributes to non-portability of queries with slices.
  - could surprise SQL-heads.

 CHOOSE:
  + slightly clearer semantics than LOOSE
  + slightly more testable than LOOSE?

We need to choose some or all of:
  Which of these do we offer?
  What are the default semantics?
  What would be good keywords for LOOSE and CHOOSE?


The proposals from the meeting *:
    default  keywords
1      ALL   DISTINCT        +1*(AndyS, ericP) -1*(SimonR)=+1
2      ALL   DISTINCT, LOOSE +1*(Souri, ericP) -1*(SimonR) +.9*(SteveH) +.1*(PatH)=+2
3    LOOSE   DISTINCT        +.5*(SimonR) -1*(ericP)=-.5
4    LOOSE   DISTINCT, ALL   +1*(SteveH, ericP)=+2
5 DISTINCT                   +1*(SimonR) -1*(SteveH, ericP)=-1
* taking PatH's "mild preference" as +.1


Currently, the domain for DISTINCTness is all the returned variables
(a subset of those mentioned in the query pattern). SimonR is
looking for a motivating use case where the domain is a different
set.


[ML] http://www.w3.org/2001/sw/DataAccess/tests/#modifer-limit
-- 
-eric

office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +1.857.222.5741

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Sunday, 11 March 2007 18:45:45 UTC