Re: Why skolemization?

On 2011-03-27, at 13:47, Sandro Hawke wrote:

> On Sun, 2011-03-27 at 02:22 +0100, Steve Harris wrote:
>> On 2011-03-27, at 00:23, Sandro Hawke wrote:
>>> On Sat, 2011-03-26 at 23:20 +0000, Steve Harris wrote:

[snip]

>>>> On 2011-03-26, at 17:07, Nathan wrote:
>>> But I don't understand your aversion to HTTP URIs for Skolem constants.
>>> You suggest that they wouldn't be appropriate because (a) there are lots
>>> of them, (b) they are short lived.  But, what's wrong with using URLs
>>> like this?
>>> 
>>>       http://garlik.com/=rdfgensym=/6135eb5943eaed2
>> 
>> Nothing at all in principle. I think there's an expectation that HTTP URIs should be long lived, Cool URIs and the like.
>> 
>> If you load the following document, delete it, then load it again:
>> 
>> _:x a :Thing .
>> 
>> You will end up generating two different skolem constants for the bNode, in Nstore at least. 
> 
> I mentioned this elsewhere in this thread as the most interesting/hard
> technical problem here.  I think of it mostly as making bnodes scope to
> the g-box.
> 
> If I say
>        store.load("http://example.org/g1")
>        store.load("http://example.org/g2")
> and g1 and g2 happen to return the same g-text containing bnodes, maybe:
>        _:x foaf:knows _:y
> then yeah, we'll have to Skolemize them differently.
> 
> But if I say:
>        store.load("http://example.org/g1")
> and then repeat it:
>        store.load("http://example.org/g1")
> and it get the same g-text, I think it's appropriate for the store to
> use the same Skolem constants.    If I get the same g-text with some
> more g-text-code appended, I'd also like to treat it as the same.  If a
> few triples are missing, I'd also like to treat it the same.   So, how
> do you do this, and where do you draw the line?  I'm not sure yet.

Yeah... my guess is that there are all kinds of tricky corner cases around this though.

It's probably quite computationally expensive too. I can imagine algorithms that would work for graphs of 100 triples, with 10 bNodes, but not 10 billion triples with 1 billion bNodes.

> Some ideas:
> 
>      * If the blank node is labeled in the g-text, as it would have to
>        be in N-Triples, and might be in the other RDF serializations,
>        then just use that labeling.   (But maybe we can be more
>        aggressive than that - even if the label is the same, maybe we
>        can treat it as the same?)

Not all stores preserve the label, Nstore's don't. It would make it less efficient to keep the label hanging about.

>      * If it has the same arcs to non-blank nodes, treat it as the same

That's the kind of thing that would be really expensive to detect on large graphs.

>      * Find whatever labeling produces a minimal number of changes, in
>        terms of adding & removing triples

That's definitely going to be expensive on large graphs.

> I need to think about this more, unless someone already knows the
> answer.
> 
> I guess one of the reasons to indicate which URIs are generated Skolem
> constants is if this algorithm turns out to have significant failure
> modes.

Honestly, I don't see us implementing anything like this in the Nstores.

>> That doesn't really sit well with HTTP URIs, for me. There's no technical issue, but if it was a different scheme you could set the expectation that the lifetime was just that of the enclosing document.
>> 
>> It will be difficult to enforce graph scope if it's just a HTTP scheme, as you may have no practical way to identify bNodes skolemised by other systems, I'm not yet bought into some magic substring that indicates skolemisation has taken place. That's maybe not an issue though, as it would cease to "be" a bNode once it was skolemised.
>> 
>>> That's a 64 bit suffix, and if you want, you can recognize the prefix
>>> and turn it back into a 64-bit value on input, for some special
>>> indexing.   What's great about it is that you can pass it on to systems
>>> which don't know about your particular SPARQL endpoint and they can find
>>> all the data again.  Assuming they have permission.  And assuming it
>>> hasn't been purged from the system, for legal reasons or whatever.  And
>>> if it has, you can make the URL 404, or even give some helpful error
>>> information.
>> 
>> We'd probably put a store-specific UUID in there as well, e.g. http://bnode.4store.org/e19863a0-580b-11e0-b8af-0800200c9a66/12345678 - though that's a bit of an eyeful. We wont be offering http://bnode.4store.org/ as a public redirection service though :) the hosting bills would be sizeable. I guess we could let store operators specify a skolem URI prefix, so they could make it dereferencable if it was possible for their data... maybe I'm coming round to the idea.
>> 
>> For practical reasons it's good if the store can identify bNodes that it minted itself, they can be compressed more effectively. That shouldn't have any bearing on the standard though, other than not ruling it out.
>> 
>>> It sounds like a pretty good design to me.
>> 
>> Me too. I currently have a mild preference for a distinct URI scheme, but I'll sleep on it. HTTP URI skolem constants would definitely be an improvement over what we've got now.
>> 
>> Perhaps systems which have no practical way to make the skolem constants dereferenceable could use one scheme, and ones which do, another?
> 
> Well, yeah, I figured the system doing the generation could freely do
> either:
> 
>        http://example.org/=rdfgensym=/668a93dc-e478-4c47-af45-f062b449cd21
> 
> or
> 
>        tag:example.org,2011:=rdfgensym=/668a93dc-e478-4c47-af45-f062b449cd21
> 
> ... based on whether it wants to support deference or not.

That would be fine, I thought you were advocating always using HTTP URIs.

Magic URI substrings still don't quite sit well with me though.

- Steve

>>>>> 4) RDF either needs blank nodes, or not, if it does, then blank node identifiers are either needed in serializations or not, and then on the next level we have management of data which includes blank nodes - it would be nice if each of the three levels where cleanly separated and agreements made with respect to each. (general application of separation of concerns to this discussion).
>>>> 
>>>> RDF needs a way to mint onetime unique identifiers (a la AUTO_INCREMENT columns in RDBMS'), but they don't often need to be existential variables. It was a pretty odd decision to define bNodes that way, IMHO.
>>>> 
>>>>> 5) If one were to look at how we name things in RDF, starting from scratch, what would be the "perfect" approach? perhaps identifying this, then seeing if it can be used, or working out steps towards, or incorporating what was learned, would be beneficial. For example I've long thought that names as pairs ( namespace, localname ) would perhaps be an improvement, I'm not suggesting this, but perhaps the ideal fix given a blank sheet of paper should be defined.
>>>> 
>>>> Something with a syntax similar to bNodes (i.e. disjoint with URIs and Literals), but which just instructed the consumer to mint a unique ID for it. This is what the majority of RDF parsers, and triplestores do internally, but then they have to jump through a load of hoops to unwind that on export, often to the annoyance of users, who might like to use the persistent internal ID to refer to it in the future.
>>>> 
>>>> There are lots of cases when trying to represent data with complex structures where you need to label/identify a sub-structure, but don't really want to give it a URI.
>>>> 
>>>> - Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Received on Sunday, 27 March 2011 17:16:31 UTC