Re: Why skolemization? from Steve Harris on 2011-03-26 (semantic-web@w3.org from March 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Sat, 26 Mar 2011 23:20:38 +0000
To: nathan@webr3.org
Cc: Sandro Hawke <sandro@w3.org>, semantic-web@w3.org
Message-Id: <32AEA461-9184-46FF-AEAD-C36CBAC9C19B@garlik.com>
On 2011-03-26, at 17:07, Nathan wrote:

> Nathan wrote:
>> Sandro Hawke wrote:
>>> Skolemization.
>> Sorry, but can somebody clarify why we, or RDF, needs Skolemization? Is this to cover a data management problem particular to a certain way of storing RDF data?
> 
> Just to save a little bit of time, I do understand the problem to some extent (although I'd like it spelled out clearly and people to agree with the problem statement), my primary concerns/thoughts are:
> 
> 1) Forcing a solution which doesn't apply across the board, for example to those using flat file storage (human managed), or saving in object tree structures (no bnode identifiers typically, just anonymous nodes/objects). As in, introducing at RDF level to cover a problem which doesn't exist at RDF level.

Personally I don't really care about making skolemisation mandatory, there might be cases where it's not necessary, or desirable. But e.g. in a triplestore you often end up being forced to do it, for practical reasons, but it's not sanctioned by any RDF specs.

> 2) Temporal nuances, let's say a bnode is skolemized/named at time1 as "xyz", then removed at time2, then at time4 a new bnode is skolemized/named with "xyz", somebody who only has a serialization from time1 and time4 would consider them the same. (hope I explained that properly)

I would argue that's as much an error as reusing a URI to represent two different things. The internal bNode skolemisation functions in 4store and 5store don't repeat in a sensible human timescale - they're rolling 62bit integers internally. Over 100 thousand years for it to repeat, if you minted a million new ones a second. 

> 3) If using any form of URI, then this is still an RDF URI Reference, so why not just use normal (say http) URIs.

When modelling data you don't always want to assign a HTTP URI to everything.

e.g. in our application we have to represent every occurrence of stolen credit card information in criminal trading environments, so we have lots of data like [note: not real data, and this isn't the ontology we use]:

[
   dc:source <irc://badguys.example/#carders> ;
   dc:date "2011-03-26T03:23:04Z"^^xsd:date ;
   garlik:cardno "1234567890123456" ;
   garlik:ccv2 "1234" ;
   garlik:postalAddress [
   ...
   ]
]

we generate millions of these a month, and they only exist for a short time (for economic and legal reasons we have to delete the structures after some time), so it's pretty undesirable to mint URIs for them all, it's much easier to let the store assign unique IDs for them.

But, if you do a query like:

SELECT ?x 
WHERE { ?x garlik:cardno "1234567890123456" ;
           dc:source <irc://badguys.example/#carders> }

You get a bunch of bNodes in the response which is pretty useless. You can repeat the query pattern to get (probably) the same bindings back, but there's no guarantees that you'll get the same set, and it's a bit of a waste of effort in both client and server.

As it happens the bNode label you get back can be used to back-generate the skolem value, and transform that into a URI scheme, but it requires store-specific knowledge, and it's not legal by any RDF spec.

> 4) RDF either needs blank nodes, or not, if it does, then blank node identifiers are either needed in serializations or not, and then on the next level we have management of data which includes blank nodes - it would be nice if each of the three levels where cleanly separated and agreements made with respect to each. (general application of separation of concerns to this discussion).

RDF needs a way to mint onetime unique identifiers (a la AUTO_INCREMENT columns in RDBMS'), but they don't often need to be existential variables. It was a pretty odd decision to define bNodes that way, IMHO.

> 5) If one were to look at how we name things in RDF, starting from scratch, what would be the "perfect" approach? perhaps identifying this, then seeing if it can be used, or working out steps towards, or incorporating what was learned, would be beneficial. For example I've long thought that names as pairs ( namespace, localname ) would perhaps be an improvement, I'm not suggesting this, but perhaps the ideal fix given a blank sheet of paper should be defined.

Something with a syntax similar to bNodes (i.e. disjoint with URIs and Literals), but which just instructed the consumer to mint a unique ID for it. This is what the majority of RDF parsers, and triplestores do internally, but then they have to jump through a load of hoops to unwind that on export, often to the annoyance of users, who might like to use the persistent internal ID to refer to it in the future.

There are lots of cases when trying to represent data with complex structures where you need to label/identify a sub-structure, but don't really want to give it a URI.

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Saturday, 26 March 2011 23:21:19 UTC