Re: Why skolemization? from Sandro Hawke on 2011-03-27 (semantic-web@w3.org from March 2011)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 27 Mar 2011 08:47:31 -0400
To: Steve Harris <steve.harris@garlik.com>
Cc: semantic-web@w3.org
Message-ID: <1301230051.28102.38.camel@waldron>
On Sun, 2011-03-27 at 02:22 +0100, Steve Harris wrote:
> On 2011-03-27, at 00:23, Sandro Hawke wrote:
> > On Sat, 2011-03-26 at 23:20 +0000, Steve Harris wrote:
> >> On 2011-03-26, at 17:07, Nathan wrote:
> >> 
> >>> Nathan wrote:
> >>>> Sandro Hawke wrote:
> >>>>> Skolemization.
> >>>> Sorry, but can somebody clarify why we, or RDF, needs Skolemization? Is this to cover a data management problem particular to a certain way of storing RDF data?
> >>> 
> >>> Just to save a little bit of time, I do understand the problem to some extent (although I'd like it spelled out clearly and people to agree with the problem statement), my primary concerns/thoughts are:
> >>> 
> >>> 1) Forcing a solution which doesn't apply across the board, for example to those using flat file storage (human managed), or saving in object tree structures (no bnode identifiers typically, just anonymous nodes/objects). As in, introducing at RDF level to cover a problem which doesn't exist at RDF level.
> >> 
> >> Personally I don't really care about making skolemisation mandatory, there might be cases where it's not necessary, or desirable. But e.g. in a triplestore you often end up being forced to do it, for practical reasons, but it's not sanctioned by any RDF specs.
> >> 
> >>> 2) Temporal nuances, let's say a bnode is skolemized/named at time1 as "xyz", then removed at time2, then at time4 a new bnode is skolemized/named with "xyz", somebody who only has a serialization from time1 and time4 would consider them the same. (hope I explained that properly)
> >> 
> >> I would argue that's as much an error as reusing a URI to represent two different things. The internal bNode skolemisation functions in 4store and 5store don't repeat in a sensible human timescale - they're rolling 62bit integers internally. Over 100 thousand years for it to repeat, if you minted a million new ones a second. 
> >> 
> >>> 3) If using any form of URI, then this is still an RDF URI Reference, so why not just use normal (say http) URIs.
> >> 
> >> When modelling data you don't always want to assign a HTTP URI to everything.
> >> 
> >> e.g. in our application we have to represent every occurrence of stolen credit card information in criminal trading environments, so we have lots of data like [note: not real data, and this isn't the ontology we use]:
> >> 
> >> [
> >>   dc:source <irc://badguys.example/#carders> ;
> >>   dc:date "2011-03-26T03:23:04Z"^^xsd:date ;
> >>   garlik:cardno "1234567890123456" ;
> >>   garlik:ccv2 "1234" ;
> >>   garlik:postalAddress [
> >>   ...
> >>   ]
> >> ]
> >> 
> >> we generate millions of these a month, and they only exist for a short time (for economic and legal reasons we have to delete the structures after some time), so it's pretty undesirable to mint URIs for them all, it's much easier to let the store assign unique IDs for them.
> >> 
> >> But, if you do a query like:
> >> 
> >> SELECT ?x 
> >> WHERE { ?x garlik:cardno "1234567890123456" ;
> >>           dc:source <irc://badguys.example/#carders> }
> >> 
> >> You get a bunch of bNodes in the response which is pretty useless. You can repeat the query pattern to get (probably) the same bindings back, but there's no guarantees that you'll get the same set, and it's a bit of a waste of effort in both client and server.
> >> 
> >> As it happens the bNode label you get back can be used to back-generate the skolem value, and transform that into a URI scheme, but it requires store-specific knowledge, and it's not legal by any RDF spec.
> > 
> > This is a really strong use case for Skolemizing, indeed.  When you get
> > SPARQL results with a bnode in them, yeah, that's when it's clear that
> > bnodes are a problem.  Thanks.
> > 
> > But I don't understand your aversion to HTTP URIs for Skolem constants.
> > You suggest that they wouldn't be appropriate because (a) there are lots
> > of them, (b) they are short lived.  But, what's wrong with using URLs
> > like this?
> > 
> >        http://garlik.com/=rdfgensym=/6135eb5943eaed2
> 
> Nothing at all in principle. I think there's an expectation that HTTP URIs should be long lived, Cool URIs and the like.
> 
> If you load the following document, delete it, then load it again:
> 
> _:x a :Thing .
> 
> You will end up generating two different skolem constants for the bNode, in Nstore at least. 

I mentioned this elsewhere in this thread as the most interesting/hard
technical problem here.  I think of it mostly as making bnodes scope to
the g-box.

If I say
        store.load("http://example.org/g1")
        store.load("http://example.org/g2")
and g1 and g2 happen to return the same g-text containing bnodes, maybe:
        _:x foaf:knows _:y
then yeah, we'll have to Skolemize them differently.

But if I say:
        store.load("http://example.org/g1")
and then repeat it:
        store.load("http://example.org/g1")
and it get the same g-text, I think it's appropriate for the store to
use the same Skolem constants.    If I get the same g-text with some
more g-text-code appended, I'd also like to treat it as the same.  If a
few triples are missing, I'd also like to treat it the same.   So, how
do you do this, and where do you draw the line?  I'm not sure yet.

Some ideas:

      * If the blank node is labeled in the g-text, as it would have to
        be in N-Triples, and might be in the other RDF serializations,
        then just use that labeling.   (But maybe we can be more
        aggressive than that - even if the label is the same, maybe we
        can treat it as the same?)
      * If it has the same arcs to non-blank nodes, treat it as the same
      * Find whatever labeling produces a minimal number of changes, in
        terms of adding & removing triples

I need to think about this more, unless someone already knows the
answer.

I guess one of the reasons to indicate which URIs are generated Skolem
constants is if this algorithm turns out to have significant failure
modes.

> That doesn't really sit well with HTTP URIs, for me. There's no technical issue, but if it was a different scheme you could set the expectation that the lifetime was just that of the enclosing document.
> 
> It will be difficult to enforce graph scope if it's just a HTTP scheme, as you may have no practical way to identify bNodes skolemised by other systems, I'm not yet bought into some magic substring that indicates skolemisation has taken place. That's maybe not an issue though, as it would cease to "be" a bNode once it was skolemised.
> 
> > That's a 64 bit suffix, and if you want, you can recognize the prefix
> > and turn it back into a 64-bit value on input, for some special
> > indexing.   What's great about it is that you can pass it on to systems
> > which don't know about your particular SPARQL endpoint and they can find
> > all the data again.  Assuming they have permission.  And assuming it
> > hasn't been purged from the system, for legal reasons or whatever.  And
> > if it has, you can make the URL 404, or even give some helpful error
> > information.
> 
> We'd probably put a store-specific UUID in there as well, e.g. http://bnode.4store.org/e19863a0-580b-11e0-b8af-0800200c9a66/12345678 - though that's a bit of an eyeful. We wont be offering http://bnode.4store.org/ as a public redirection service though :) the hosting bills would be sizeable. I guess we could let store operators specify a skolem URI prefix, so they could make it dereferencable if it was possible for their data... maybe I'm coming round to the idea.
> 
> For practical reasons it's good if the store can identify bNodes that it minted itself, they can be compressed more effectively. That shouldn't have any bearing on the standard though, other than not ruling it out.
> 
> > It sounds like a pretty good design to me.
> 
> Me too. I currently have a mild preference for a distinct URI scheme, but I'll sleep on it. HTTP URI skolem constants would definitely be an improvement over what we've got now.
> 
> Perhaps systems which have no practical way to make the skolem constants dereferenceable could use one scheme, and ones which do, another?

Well, yeah, I figured the system doing the generation could freely do
either:

        http://example.org/=rdfgensym=/668a93dc-e478-4c47-af45-f062b449cd21

or

        tag:example.org,2011:=rdfgensym=/668a93dc-e478-4c47-af45-f062b449cd21
        
... based on whether it wants to support deference or not.

   -- Sandro

> - Steve
> 
> >>> 4) RDF either needs blank nodes, or not, if it does, then blank node identifiers are either needed in serializations or not, and then on the next level we have management of data which includes blank nodes - it would be nice if each of the three levels where cleanly separated and agreements made with respect to each. (general application of separation of concerns to this discussion).
> >> 
> >> RDF needs a way to mint onetime unique identifiers (a la AUTO_INCREMENT columns in RDBMS'), but they don't often need to be existential variables. It was a pretty odd decision to define bNodes that way, IMHO.
> >> 
> >>> 5) If one were to look at how we name things in RDF, starting from scratch, what would be the "perfect" approach? perhaps identifying this, then seeing if it can be used, or working out steps towards, or incorporating what was learned, would be beneficial. For example I've long thought that names as pairs ( namespace, localname ) would perhaps be an improvement, I'm not suggesting this, but perhaps the ideal fix given a blank sheet of paper should be defined.
> >> 
> >> Something with a syntax similar to bNodes (i.e. disjoint with URIs and Literals), but which just instructed the consumer to mint a unique ID for it. This is what the majority of RDF parsers, and triplestores do internally, but then they have to jump through a load of hoops to unwind that on export, often to the annoyance of users, who might like to use the persistent internal ID to refer to it in the future.
> >> 
> >> There are lots of cases when trying to represent data with complex structures where you need to label/identify a sub-structure, but don't really want to give it a URI.
> >> 
> >> - Steve
> >> 
> > 
> > 
> 
> -- 
> Steve Harris, CTO, Garlik Limited
> 1-3 Halford Road, Richmond, TW10 6AW, UK
> +44 20 8439 8203  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
> 
>
Received on Sunday, 27 March 2011 12:47:40 UTC