Re: Why skolemization? from Sandro Hawke on 2011-03-27 (semantic-web@w3.org from March 2011)

From: Sandro Hawke <sandro@w3.org>
Date: Sat, 26 Mar 2011 20:23:17 -0400
To: Steve Harris <steve.harris@garlik.com>
Cc: semantic-web@w3.org
Message-ID: <1301185397.3138.3035.camel@waldron>
On Sat, 2011-03-26 at 23:20 +0000, Steve Harris wrote:
> On 2011-03-26, at 17:07, Nathan wrote:
> 
> > Nathan wrote:
> >> Sandro Hawke wrote:
> >>> Skolemization.
> >> Sorry, but can somebody clarify why we, or RDF, needs Skolemization? Is this to cover a data management problem particular to a certain way of storing RDF data?
> > 
> > Just to save a little bit of time, I do understand the problem to some extent (although I'd like it spelled out clearly and people to agree with the problem statement), my primary concerns/thoughts are:
> > 
> > 1) Forcing a solution which doesn't apply across the board, for example to those using flat file storage (human managed), or saving in object tree structures (no bnode identifiers typically, just anonymous nodes/objects). As in, introducing at RDF level to cover a problem which doesn't exist at RDF level.
> 
> Personally I don't really care about making skolemisation mandatory, there might be cases where it's not necessary, or desirable. But e.g. in a triplestore you often end up being forced to do it, for practical reasons, but it's not sanctioned by any RDF specs.
> 
> > 2) Temporal nuances, let's say a bnode is skolemized/named at time1 as "xyz", then removed at time2, then at time4 a new bnode is skolemized/named with "xyz", somebody who only has a serialization from time1 and time4 would consider them the same. (hope I explained that properly)
> 
> I would argue that's as much an error as reusing a URI to represent two different things. The internal bNode skolemisation functions in 4store and 5store don't repeat in a sensible human timescale - they're rolling 62bit integers internally. Over 100 thousand years for it to repeat, if you minted a million new ones a second. 
> 
> > 3) If using any form of URI, then this is still an RDF URI Reference, so why not just use normal (say http) URIs.
> 
> When modelling data you don't always want to assign a HTTP URI to everything.
> 
> e.g. in our application we have to represent every occurrence of stolen credit card information in criminal trading environments, so we have lots of data like [note: not real data, and this isn't the ontology we use]:
> 
> [
>    dc:source <irc://badguys.example/#carders> ;
>    dc:date "2011-03-26T03:23:04Z"^^xsd:date ;
>    garlik:cardno "1234567890123456" ;
>    garlik:ccv2 "1234" ;
>    garlik:postalAddress [
>    ...
>    ]
> ]
> 
> we generate millions of these a month, and they only exist for a short time (for economic and legal reasons we have to delete the structures after some time), so it's pretty undesirable to mint URIs for them all, it's much easier to let the store assign unique IDs for them.
> 
> But, if you do a query like:
> 
> SELECT ?x 
> WHERE { ?x garlik:cardno "1234567890123456" ;
>            dc:source <irc://badguys.example/#carders> }
> 
> You get a bunch of bNodes in the response which is pretty useless. You can repeat the query pattern to get (probably) the same bindings back, but there's no guarantees that you'll get the same set, and it's a bit of a waste of effort in both client and server.
> 
> As it happens the bNode label you get back can be used to back-generate the skolem value, and transform that into a URI scheme, but it requires store-specific knowledge, and it's not legal by any RDF spec.

This is a really strong use case for Skolemizing, indeed.  When you get
SPARQL results with a bnode in them, yeah, that's when it's clear that
bnodes are a problem.  Thanks.

But I don't understand your aversion to HTTP URIs for Skolem constants.
You suggest that they wouldn't be appropriate because (a) there are lots
of them, (b) they are short lived.  But, what's wrong with using URLs
like this?

        http://garlik.com/=rdfgensym=/6135eb5943eaed2
        
That's a 64 bit suffix, and if you want, you can recognize the prefix
and turn it back into a 64-bit value on input, for some special
indexing.   What's great about it is that you can pass it on to systems
which don't know about your particular SPARQL endpoint and they can find
all the data again.  Assuming they have permission.  And assuming it
hasn't been purged from the system, for legal reasons or whatever.  And
if it has, you can make the URL 404, or even give some helpful error
information.

It sounds like a pretty good design to me.

      -- Sandro




> > 4) RDF either needs blank nodes, or not, if it does, then blank node identifiers are either needed in serializations or not, and then on the next level we have management of data which includes blank nodes - it would be nice if each of the three levels where cleanly separated and agreements made with respect to each. (general application of separation of concerns to this discussion).
> 
> RDF needs a way to mint onetime unique identifiers (a la AUTO_INCREMENT columns in RDBMS'), but they don't often need to be existential variables. It was a pretty odd decision to define bNodes that way, IMHO.
> 
> > 5) If one were to look at how we name things in RDF, starting from scratch, what would be the "perfect" approach? perhaps identifying this, then seeing if it can be used, or working out steps towards, or incorporating what was learned, would be beneficial. For example I've long thought that names as pairs ( namespace, localname ) would perhaps be an improvement, I'm not suggesting this, but perhaps the ideal fix given a blank sheet of paper should be defined.
> 
> Something with a syntax similar to bNodes (i.e. disjoint with URIs and Literals), but which just instructed the consumer to mint a unique ID for it. This is what the majority of RDF parsers, and triplestores do internally, but then they have to jump through a load of hoops to unwind that on export, often to the annoyance of users, who might like to use the persistent internal ID to refer to it in the future.
> 
> There are lots of cases when trying to represent data with complex structures where you need to label/identify a sub-structure, but don't really want to give it a URI.
> 
> - Steve
>
Received on Sunday, 27 March 2011 00:23:29 UTC