Re: "Simple Lists" (was Re: ISSUE-77: Should we mark rdf:Seq as archaic (cf ISSUE-24)) from Steve Harris on 2011-10-16 (public-rdf-wg@w3.org from October 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Sun, 16 Oct 2011 21:28:32 +0100
To: Sandro Hawke <sandro@w3.org>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <53140AFB-46C6-4AED-9F9A-C86057806672@garlik.com>
On 16 Oct 2011, at 19:26, Sandro Hawke wrote:

> On Sun, 2011-10-16 at 19:13 +0100, Steve Harris wrote:
>> On 15 Oct 2011, at 23:35, Sandro Hawke wrote:
>> 
>>> On Sat, 2011-10-15 at 19:35 +0100, Andy Seaborne wrote:
>>>> 
>>>> On 15/10/11 19:09, Jeremy Carroll wrote:
>>>>> 
>>>>> I think both the Seq and the List constructs present technical issues.
>>>>> Basically it is because both present the possibility of 'bad' data and
>>>>> no clarity about what one should do in the face of it.
>>>> 
>>>> +1
>>>> 
>>>>> We can easily form ill-formed lists with rdf:first or rdf:rest either
>>>>> missing or multiple.
>>>>> We can easily form ill-formed sequences with duplicate or missing rdf:_2
>>>> 
>>>> although Seq are very fragile and lists are merely fragile.  The 
>>>> duplicate rdf:_2 by merging is really nasty.
>>>> 
>>>>> The consumer of such ill-formed data is in a bind
>>>>> And what's worse is that formally the ill-formed data is not ill-formed,
>>>>> it is just triples.
>>>>> 
>>>>> We could label both with a health warning ...
>>>> 
>>>> Sandro said that:
>>>>> I think Turtle makes RDF Collections seem quite
>>>>> nice, and hopefully that will quickly set the tone (perhaps with a
>>>>> little help from us) for APIs and SPARQL 1.2 (?) having nice list
>>>>> handling functions that are as efficient as native (non-destructive)
>>>>> list handling functions. (I hope some APIs do this already.)
>>>> 
>>>> and the point about Turtle syntax, and the convenience of writing, is 
>>>> important.
>>>> 
>>>> Jena has container and collection APIs to make working with containers, 
>>>> collections easier but the details leak out if you can take a triple view.
>>> 
>>> Thinking about it today, I wonder if we can define a "Simple List" (or
>>> "Proper List", or something) as a list that can be losslessly
>>> transmitted via turtle's (...) syntax.  That means its structure is
>>> b-nodes, with no extra arcs, etc.   (Interestingly, it also means you
>>> can't include rdf:type rdf:List arcs....)   Then, we encourage tools to
>>> read/write simple lists, and to work with them efficiently.
>> 
>> To be honest, I think the ref:Collection structure is just a dead duck.
>> 
>> It works OK for lists on the order of 10 items, but is pretty impractical for thousands, or tens of thousands.
> 
> My claim is that in the typical or best cases, Simple Lists can be
> handled as efficiently as any general List structure, using your choice
> of single-linked list, doubly-linked lists, arrays, or possibly even
> something more complicated.

That sounds good…

> Basically, I'm suggesting systems only turn them into something visible
> as triples when someone really wants that, and in general, no one should
> want that very much.    I think it will happen in toy systems and

Great, but how would you tell when people want to see the structure? I don't think it's often useful, except when you're exporting data, and then I expect you'd much rather see it rendered (in Turtle at least) as (1 2 3 …). Which suggests that you shouldn't ever bother with the rdf:first / rest business.

> debugging systems, where people do ?p ?o queries; but in deployed apps,
> where the software is going to take some action based on particular
> predicates, it's not going to be querying for rdf:first or rdf:rest
> triples.

Handling this for the ?p case is going to be tricky. Not impossible, but tricky.

> Certainly this would be more work for triple stores, but I think it's
> it's probably a better option than coming up with something completely
> new.

Sure, but if the mechanism is sufficiently complex to implement then a large number of systems wont implemented it, and it's not really any better than investing something new (not that I think we should do that).

>> I would be mildly opposed to anything which promoted it's use further.
>> 
>>> Simple lists have the advantage over Seq, I believe, that in the face of
>>> truth-preserving RDF operations (subsetting, merging, various sound
>>> inferences), they never produce wrong data.  In the worst case, they no
>>> longer provide data -- the simple list is mangled in some way -- but
>>> it'll never just tell you the wrong thing.   I think this is a big win.
>> 
>> True, but detecting if a Collection is mangled is computationally expensive.
> 
> I'm picturing Simple List handling being done using native list
> structures, which can't get mangled.   So the only detection is when
> someone creates a Simple List *not* using provided list machinery, which
> should be a rare case.

OK.

- Steve

>>>> Ivan:
>>>>> But it is a bit of a problem that SPARQL 1.1 still does not cover list handling fully:-(
>>>> 
>>>> SPARQL 1.2 will not solve anything I'm afraid.  SPARQL 1.1 Query has 
>>>> gone as far as it can, except maybe a little extra syntactic sugar with
>>>> 
>>>> { ?list rdf:rest*/rdf:first ?member }
>>>> 
>>>> It's much better than handling Seqs.
>>> 
>>> I'm trying to brainstorm ways to shoe-horn list handling into SPARQL.  I
>>> don't know if there's any elegant way, but maybe there's a hack that's
>>> not too bad.   
>>> 
>>> One approach is to update the results format to allow lists of values
>>> where it currently allows single values.   And then offer some way to
>>> signal you want Simple Lists to be returned as list values instead of
>>> b-nodes.    One way to do that would be a LISTRESULT function that takes
>>> a simple list's starting bnode and returns something that the results
>>> format serializes as a list.  Essentially, it's just a way of saying you
>>> want a list result here.
>>> 
>>> So...
>>> 
>>> SELECT ?x ?y LISTRESULT(?z) WHERE...
>>> 
>>> would require ?z to be bound to a simple list and would pack the list
>>> elements into the result format in a manner specific to that results
>>> format (XML, JSON, etc).  
>>> 
>>> Other, normal list builtins like SUBLIST(list, startpos, stoppos) could
>>> be used to make sure the size of the returned list is manageable.
>>> 
>>> Another approach, perhaps, would be some kind of dis-aggregator, a pair
>>> of builtins that work together to make a list appear like many different
>>> results:
>>> 
>>> Data:
>>>  eg:Alice eg:likes ( eg:Bob eg:Charlie ).   
>>> 
>>> Query:
>>>  SELECT ?who ITERINDEX(?list) ITERVALUE(?list)
>>>  WHERE ?who eg:likes ?list.
>>> 
>>> would return results:
>>>   eg:Alice 0 eg:Bob
>>>   eg:Alice 1 eg:Charlie
>>> 
>>> although not necessarily in that order.
>>> 
>>> That's pretty messy to spec and implement, but might be pretty nice to
>>> use.
>>> 
>>> 
>>>> SPARQL Update can manuipuate lists but it's ugly:
>>>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2011JanMar/0389.html
>>>> 
>>>> 
>>>> The fundamental problem in SPARQL is that any order is lost; so this 
>>>> list access works for some cases, where the order does not matter.
>>>> 
>>>> Even if a special order preserving construct were available, order is 
>>>> lost in the rest of the query.  An order-presering QL would not be 
>>>> SPARQL 1.2 - it would be have completely different basis, (e.g. no 
>>>> chance of implementing use hash joins), would be very hard to have 
>>>> parallel implementation (see "big data" graph languages), and still does 
>>>> not work when two ordered different subresults need combining.
>>>> 
>>>> Fundamentally, there are two problems:
>>>> 
>>>> 1/ Encoding in triples
>>>> 2/ Lists aren't the only datastructure.
>>>> 
>>>> Reification, containers and collections encode data structure in triples 
>>>> but if the app can see "triples" then this leaks through to the 
>>>> application.  It also means there can be the possibility of 'bad' data 
>>>> as Jeremy says.  Seeing the triples is confusing at best.
>>> 
>>> Yes, seeing the triples is a problem, but I'm hoping it's not that bad,
>>> and that mostly people pull what they want out of a graph and ignore the
>>> rest. 
>>> 
>>>> The structure we have may not say what you want:
>>>>  List(1 2 3) != List(3 2 1)
>>>> but if a list is being used to express an unordered collection, a higher 
>>>> level convention has to be communicated.
>>>> 
>>>> I think the only complete solution will involve putting structural 
>>>> literals into RDF itself, so they are not triple-encoded and can't be 
>>>> 'bad'.  When treated as first-class literals with equality rules, 
>>>> accessors, and combining rules, then implementations can store them 
>>>> specially, provide good APIs, and application programmer won't have to 
>>>> learn about the encoding rules.
>>> 
>>> That sounds pretty hard.  Do you have some design in mind...?
>>> 
>>>    - Sandro
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
>
Received on Sunday, 16 October 2011 20:29:16 UTC