Re: Where to put the knowledge you add from Kingsley Idehen on 2011-10-12 (public-lod@w3.org from October 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 12 Oct 2011 09:32:34 -0400
To: public-lod@w3.org
Message-ID: <4E9596F2.1080207@openlinksw.com>
On 10/12/11 8:49 AM, glenn mcdonald wrote:
> I agree with this entirely, and it's why I keep insisting that for 
> most purposes datasets should be expressed using local identifiers, 
> with all external linkages called out explicitly and/or externally. 
> owl:sameAs and the use of other people's identifiers for your own 
> nodes are equally dangerous. If I'm asserting that Brussels is the 
> capital of Belgium, I'm saying that my notion of Brussels is my notion 
> of "capital" of my notion of Belgium. I am the authority for that 
> assertion. Saying that my notion of Brussels, "capital" or Belgium 
> correspond with anybody else's notion of anything are separate 
> assertions, for which I do not have the same authority.
>
> For that matter, the proper interpretation of "correspond" depends on 
> the purpose: for some things, treating "correspond" as owl:sameAs may 
> be exactly right, and for some it might be utterly unacceptable. And 
> it's much easier to map a "corresponds" property to owl:sameAs if you 
> want to than to rewrite an entire dataset to undo the misapplication 
> of IDs or owl:sameAs.
>
> Think global, assert local.

Glenn / Hugh,

A data space admin (or authorized curator) can think globally and assert 
locally, in the realm of Linked Data by doing the following:

1. Partition Datasets by Named Graph IRI
2. Make the main Dataset e.g. (DBpedia) the default graph for a given 
Linked Data Space (e.g. DBpedia and its SPARQL endpoint).

What went wrong here?

Hugh: yesterday, in our private exchange, I indicated to you that  we 
(OpenLink Software) loaded the NYT dataset into the <http://dbpedia.org> 
Graph IRI which is also the default graph of the DBpedia SPARQL 
endpoint. It should have been in loaded into its own Named Graph with 
its own Graph IRI. After further investigation, that wasn't 100% 
accurate. Here's what's happened, and its boils down to confusion about 
what constitutes the DBpedia 3.7 dataset:

1. http://wiki.dbpedia.org/Downloads37 -- there are many datasets on 
that page, but we loaded the lot (as has been the case in the past) into 
the graph IRI <http://dbpedia.org>

2. Then when I ran my simple check via: 
http://dbpedia.org/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdata.nytimes.com%2F60370132632367982721 
-- I assumed an errant load, which isn't the case since the NYT dataset 
was part of the post-final-qa payload (a tarball from our colleagues at 
Freie) which we loaded into the <http://dbpedia.org> graph.

Fix options:

1. We can easily remove the errant triples -- bit we will need a list so 
we do this one time

2. Get NYT to fix their dataset once and for all otherwise it will be 
quarantined in it own named graph and we'll keep a marker in place on it 
re. future loads until its fixed.

Glenn:
Now, if you go through the archives of this mailing list, you'll see 
earlier posts where I pointed out this pattern to Hugh (maybe are year 
to two ago). As is really the case most of the time, your concerns are 
factored into what we do, I just need to find the right language for 
articulating that to you :-)


Kingsley
>
> glenn
>
>
> On Wed, Oct 12, 2011 at 7:55 AM, Hugh Glaser <hg@ecs.soton.ac.uk 
> <mailto:hg@ecs.soton.ac.uk>> wrote:
>
>
>     Hi.
>
>     I have argued for a long time that the linkage data (in particular
>     owl:sameAs and similar links) should not usually be mixed with the
>     knowledge being published.
>
>     Thus, for example as I discussed with Evan for the NYTimes site a
>     while ago, it is not a good thing to put the owl:sameAs links
>     (which were produced by a relatively unskilled individual over a
>     short period of time) at the same status as the other data, which
>     has been curated over decades by expert reporters.
>
>     These sameAs links have potentially very different trust,
>      provenance, licence, and possibly other non-functional attributes
>     from the substantive data.
>     Clearly they have different trust and provenance, but licence may
>     well be different, as the NYT may want people to take the triples
>     away to bring traffic to their site, while keeping the other
>     triples under more restricted licence.
>
>     Which brings me to an example of where things have recently gone
>     badly wrong.
>     I have reported a bug to the dbpedia team wherein the URIs for
>     countries have become deeply intertwingled.
>     Example queries are at the end of this message - they have to
>     explicitly do the owl:sameAs because the store does not do
>     owl:sameAs inference, but the outcome is that I can validly infer
>     answers such as "Maseru is the capital of Belgium".
>
>     Of course, mistakes happen, so I am not having a specific go at
>     dbpedia, which I still think is wonderful.
>
>     But the outcome is that I get very bad data from dbpedia.org
>     <http://dbpedia.org> unexpectedly, which means I (and presumably
>     anyone else) can't reliably use dbpedia.org <http://dbpedia.org>
>     at all (because I use an inference engine when I cache the data).
>     Had the dbpedia.org <http://dbpedia.org> site simply stuck to the
>     behaviour I was sort of expecting of publishing data from
>     wikipedia (possibly publishing the linkage data elsewhere) I would
>     have been in a better position.
>
>     One of the issues here is to realise when we are actually adding
>     knowledge to a triplication process.
>     It is clear when things like owl:sameAs are added that knowledge
>     is being added.
>     However, people probably consider it less clear if URIs from
>     dbpedia or elsewhere are directly used that they are adding their
>     own knowledge.
>     In a similar way, such use introduces knowledge which may have
>     very different trust and provenance from the data being triplified.
>
>     Is this a good way to do things?
>
>     I would say not.
>     I have used a wide variety of Linked Data sources, and have found
>     problems with almost every one of them (possibly every significant
>     one).
>     The problems frequently relate to the extra knowledge that the
>     triplication process has introduced.
>     If only I could be given the data without, then I would not have
>     to reject the dataset.
>
>     Thanks for reading this far.
>     Best
>     Hugh
>
>     Query:
>     SELECT DISTINCT ?capital WHERE {
>      ?s owl:sameAs <http://dbpedia.org/resource/Belgium> .
>      ?s owl:sameAs ?country .
>      ?country <http://dbpedia.org/ontology/capital> ?capital .
>     }
>
>     As a URI:
>     http://dbpedia.org/snorql/?query=SELECT+DISTINCT+%3Fcapital+WHERE+%7B%0D%0A+%3Fs+owl%3AsameAs+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FBelgium%3E+.%0D%0A+%3Fs+owl%3AsameAs+%3Fcountry+.%0D%0A+%3Fcountry+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2Fcapital%3E+%3Fcapital+.%0D%0A%7D%0D%0A
>
>     Output:
>     capital
>     http://dbpedia.org/resource/City_of_Brussels
>     http://dbpedia.org/resource/Maseru
>
>
>     --
>     Hugh Glaser,
>                  Web and Internet Science
>                  Electronics and Computer Science,
>                  University of Southampton,
>                  Southampton SO17 1BJ
>     Work: +44 23 8059 3670 <tel:%2B44%2023%208059%203670>, Fax: +44 23
>     8059 3045 <tel:%2B44%2023%208059%203045>
>     Mobile: +44 75 9533 4155 <tel:%2B44%2075%209533%204155> , Home:
>     +44 23 8061 5652 <tel:%2B44%2023%208061%205652>
>     http://www.ecs.soton.ac.uk/~hg/ <http://www.ecs.soton.ac.uk/%7Ehg/>
>
>
>


-- 

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Attachments

application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Wednesday, 12 October 2011 13:32:59 UTC