Re: Radio station metadata use case from Seaborne, Andy on 2005-04-19 (public-rdf-dawg@w3.org from April to June 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Tue, 19 Apr 2005 12:23:14 +0100
To: Steve Harris <S.W.Harris@ecs.soton.ac.uk>
CC: DAWG public list <public-rdf-dawg@w3.org>
Message-ID: <4264EA22.8040300@hp.com>
Steve Harris wrote:
> On Mon, Apr 18, 2005 at 06:37:11PM +0100, Andy Seaborne wrote:
> 
>>
>>Steve Harris wrote:
>>
>>>Based on my experience of implementing the current editors WD and helping
>>>build an application.
>>>
>>>The local student radio station (http://www.surgeradio.co.uk) uses RDF to
>>>describe its playlists, handle requests and so on. They use the
>>>Musicbrainz RDF (split into files per artist and disk to make applying
>>>updates more efficient) to talk about released CDs, and some locally
>>>created data (split same way) to talk about white label singles thay have
>>>received through the post.
>>>
>>>If/when the white label stuff gets released they send it to musicbrainz
>>>and remove the local copy.
>>>
>>>All the data is "trusted" and so exitsts in the background graph, but its
>>>also kept in named graphs to allow provenential queries (MB v's local, who
>>>wrote it etc.) to be answered.
>>>
>>>My first thought about how to handle this case was to flag the graphs as
>>>being in the background/named/both graph sets which allows me to store
>>>this efficiently, but it makes queries too expensive, and in my currentl
>>>implementation at least bNodes get shared between the background and named
>>>graphs, which only matters in corner cases, but does change the
>>>smenantics.
>>
>>Steve - I don't see what makes the query any more expensive.  Why does 
>>sorting quads (with or without a trusted flag) make this mapping fail:
>>
>>{ ?x ?y ?z } =>             (?x ?y ?z *)     with * for any
>>GRAPH <u> { ?x ?y ?z } =>   (?x ?y ?z <u>)
> 
> 
> This is not an implementation of the specification - it does not allow
> statements to exist in named gaphs but not the background graph.

There are several different implementation technologies that people are using 
for SPARQL implementations; SQL, logic engines, rule engines, custom systems 
amongst others.  Not all implemenations will be the same.

Som eof these systems have a strong emphasis on trust - e.g. TriQL.P where the 
trust model isn't even fixed.  We have to cover a wide range of options for a 
wide range of uses.

When I added back the text into rq23 for FROM/FROM NAMED, I specifically did not 
put in anything about merging as we don't have agreement in that area.

http://lists.w3.org/Archives/Public/public-rdf-dawg/2005AprJun/0090.html

A dataset is a background graph and a number of named graphs.  How those graphs 
come about is outside the spec.

This quad style is an implementation of the specification and it makes 
assumptions outside SPARQL.  Not all implmentations have to be the same - some 
will specialise in what they do well.  The design above is specialised to 
efficient implementation for the case of the background graph being the RDF 
merge of the named graphs.

That is not the only scenario for usage - if DAWG were to choose to force that 
implementation, it is rejecting the requirments of other systems.

>  
> 
>>The bNodes is a free choice.  RDF does not say whether they are same or 
>>diferent across graphs.  More on this below.
> 
> 
> Uggh. That makes my head hurt. If it is a free choice in RDF it better not
> be in SPARQL otherwise we have the potential for some really confusing
> results when graphs are loaded into both the background and named graphs,
> eg. in lists:
> 
> SELECT ?cdr
> WHERE GRAPH <http://example.com/data.rdf> { :foo rdf:first ?car .  }
> 	?car rdf:rest ?cdr .
> 
> (with appologies if I've forgotten the rdf list syntax)

[[Oddly, I have an experimental grammar that loosens some restrictions and would 
parse that!]]

>  
> 
>>>My final implementation was a naive implementation of whats in the spec,
>>>as I understand it. I used a distinguished graph ideentified to
>>>distinguish things in the background graph. I think assertion performance
>>>is bad, but I've not worked on it.
>>>
>>>However, using this implementation I then couldn't remove subsets of the
>>>background graph (eg. locally created graphs that are now redundant).
>>
>>If this (removing named graphs affecting the backgroudn graph) is a 
>>requirement, then the background graph must share bNodes (or keep a 
>>mapping) with the named graphs surely?  This seems to be true regardless of 
>>which scheme we are considering.
> 
> 
> Yes, the choice is wether the graph that is used for default answering is
> the same as the one that is used for GRAPH answering or not.
>  
> 
>>>The
>>>named part of the data can be removed easily, by using its graph
>>>idetenifier, but all triples in the background graph cant be distinguished
>>>in my implementation.
>>
>>Interesting - so if the schema sparates teh concerns for data management 
>>from the concerns for query then storing 5-tuples (you'd want to normalize 
>>as well):
>>
>>  (<s>  <p>  <o>  URI-or-null   original-named-graph)
>>
>>and doing datamanagement based on slot 5, and query based on slot 4 might 
>>work. BNodes decision permitting.
>>
>>To separate bNodes, then insert a new 5-slot "triple" keeping the 
>>original-named-graph indicator so it can be mass-removed.
> 
> 
> This means we need to go beyon terms in SPARQL to do data mangement, which
> I dont want to do.

> 
>>>I would be possible to subidentify the triples in the background graph in
>>>som way, but that identification can't be discovered from SPARQL which
>>>would make extending it to be INSERT/UPDATE in the future painful, and
>>>would complicate the data storge.
>>
>>Seems to me that data management and presented information aren't necessary 
>>identifical so using the same information is likely to lead to trouble 
>>somewhere.  This makes INSERT/UPDATE orthogonal to query.
> 
> 
> The're not neccesarily identical, sure, but I would find it mighty
> supprising if thier not.

Since there are all sorts of issues about the management of merged graphs, then 
anything to do with data management must necessarily be outside SPARQL.

>  
> 
>>>Another option I considered was to keep a copy of the graph as asserted,
>>>and remove it when reqested, but it gets a bit complicated as I have to
>>>keep a count on the numer of times any particular statement has been
>>>asserted in the background graph, and I'm concerned about synchronisation
>>>issues.
>>>
>>>The design I posted earlier
>>>(http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JanMar/0440.html)
>>>turns out not to have this problem (though that wasn't what motivated the
>>>design). As all graphs are named the application can do management on
>>>data about individual disks in the background graph.
>>
>>I'm not clear anymore on this - is the distinguished named graph the RDF 
>>merge of some other graphs or not?  This seems to say it is not a copy so, 
>>with shared bNodes, it is the same as your first thought except there the 
>>distinguished graph has a hidden name (not visible to the query).
> 
> 
> There is no distingushed graph per-se (there was in my rq23
> implementation, but thats another issue).

And it is a valid implementation as far as I can see.  It just happens not to be 
the only possibility.

> All there is a set of named
> graphs, of which a sub-set are used to match triple patterns that dont
> use the GRAPH keyword.

But what happens to bNodes?  3Store alloctaes global Ids to them (outside URI 
space but they are system-wide unique).  If you want unambiguous handling of 
BNodes, you have to say what happens before I can really answer

e.g. a weird case:

SELECT ?name
WHERE { GRAPH <g1> { ?x foaf:mbox <mailto:alice@example.org> }
         GRAPH <g2> { ?x foaf:name ?name }
       }

Can smushing make that legal?

	Andy

> 
> - Steve
>
Received on Tuesday, 19 April 2005 11:25:23 UTC