Re: Dealing with distributed nature of Linked Data and SPARQL from Paul Houle on 2016-06-08 (public-lod@w3.org from June 2016)

From: Paul Houle <ontology2@gmail.com>
Date: Wed, 8 Jun 2016 14:11:23 -0400
To: Rob Davidson <rob.les.davidson@gmail.com>
Cc: "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk>, Martynas Jusevičius <martynas@graphity.org>, public-lod <public-lod@w3.org>, "public-declarative-apps@w3.org" <public-declarative-apps@w3.org>, James Anderson <james@dydra.com>, Arto Bendiken <arto@dydra.com>
Message-ID: <CAE__kdSSQDN3kURonWEV4ugAp7bytPXAMkZ+T2j5UQ8ZMHWMhg@mail.gmail.com>
You've got it!

What matters is what your system believes is owl:sameAs based on its
viewpoint,  which could be based on who you trust to say owl:sameAs.  If
you are worried about "inference crashes" pruning this data is the place to
start.

You might want to apply algorithm X to a graph,  but data Y fails to have
property Z necessary for X to succeed.  It is a general problem if you are
sending a product downstream.

A processing module can massage a dataset so that the output graph Y always
has property Z or it fails and calls bloody murder if Z is not set,  etc.
It can emit warning messages that you could use to sweep for bad spots,
 etc.


On Wed, Jun 8, 2016 at 1:50 PM, Rob Davidson <rob.les.davidson@gmail.com>
wrote:

> I'm not sure if I'm following exactly, so bear with me...
>
> If we have the same entity served up by two different sources then we
> might expect in an ideal world that there would be an OWL:sameAs or
> SKOS:exactMatch linking the two.
>
> If we have the same entity served by the same provider but via two
> different endpoints then we might expect something a bit like a
> DCAT:distribution link relating the two.
>
> Of course we might not have these specific links but I'm just trying to
> define the likely scenarios/use-cases.
>
> In either case, it's possible that the descriptions would be out of date
> and/or contradictory - this might cause inference crashes or simply be
> confusing if we tried to merge them too closely.
>
> Prioritising description fields based on the distribution method seems a
> little naive in that I might run either endpoint for a while, realise my
> users prefer the alternative and thus change technology in a direction
> unique to my users - not in a predictable fashion.
>
> So the only way I can see around this is to pool the descriptions but have
> them distinguished using the other metadata that indicates they come from
> different endpoints/sources/authors - keeping the descriptions on different
> graphs I suppose.
>
>
>
>
> On 8 June 2016 at 14:52, Paul Houle <ontology2@gmail.com> wrote:
>
>> The vanilla RDF answer is that the data gathering module ought to pack
>> all of the graphs it got into named graphs that are part of a data set and
>> then pass that towards the consumer.
>>
>> You can union the named graphs for a primitive but effective kind of
>> "merge" or put in some module downstream that composites the graphs in some
>> arbitrary manner,  such as something that converts statements about people
>> to foaf: vocabulary to produce enough graph that would be piped downstream
>> to a foaf: consumer for instance.
>>
>> The named graphs give you sufficient anchor points to fill up another
>> dataset with metadata about what happened in the processing process so you
>> can follow "who is responsible for fact X?" past the initial data
>> transformations.
>>
>> On Wed, Jun 8, 2016 at 8:29 AM, Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk>
>> wrote:
>>
>>> Hi
>>>
>>> Option 3 seems sensible, particularly if you keep them in separate
>>> graphs.
>>>
>>> However shouldn’t you consider the provenance of the sources and
>>> prioritise them on how recent they were updated?
>>>
>>> Alasdair
>>>
>>> On 8 Jun 2016, at 13:06, Martynas Jusevičius <martynas@graphity.org>
>>> wrote:
>>>
>>> Hey all,
>>>
>>> we are developing software that consumes data both from Linked Data
>>> and SPARQL endpoints.
>>>
>>> Most of the time, these technologies complement each other. We've come
>>> across an issue though, which occurs in situations where RDF
>>> description of the same resources is available using both of them.
>>>
>>> Lest take a resource http://data.semanticweb.org/person/andy-seaborne
>>> as an example. Its RDF description is available in at least 2
>>> locations:
>>> - on a SPARQL endpoint:
>>>
>>> http://xmllondon.com/sparql?query=DESCRIBE%20%3Chttp%3A%2F%2Fdata.semanticweb.org%2Fperson%2Fandy-seaborne%3E
>>> - as Linked Data: http://data.semanticweb.org/person/andy-seaborne/rdf
>>>
>>> These descriptions could be identical (I haven't checked), but it is
>>> more likely than not that they're out of sync, complementary, or
>>> possibly even contradicting each other, if reasoning is considered.
>>>
>>> If a software agent has access to both the SPARQL endpoint and Linked
>>> Data resource, what should it consider as the resource description?
>>> There are at least 3 options:
>>> 1. prioritize SPARQL description over Linked Data
>>> 2. prioritize Linked Data description over SPARQL
>>> 3. merge both descriptions
>>>
>>> I am leaning towards #3 as the sensible solution. But then I think the
>>> end-user should be informed which part of the description came from
>>> which source. This would be problematic if the descriptions are
>>> triples only, but should be doable with quads. That leads to another
>>> problem however, that both LD and SPARQL responses are under-specified
>>> in terms of quads.
>>>
>>> What do you think? Maybe this is a well-known issue, in which case
>>> please enlighten me with some articles :)
>>>
>>>
>>> Martynas
>>> atomgraph.com
>>> @atomgraphhq
>>>
>>>
>>> Alasdair J G Gray
>>> Fellow of the Higher Education Academy
>>> Assistant Professor in Computer Science,
>>> School of Mathematical and Computer Sciences
>>> (Athena SWAN Bronze Award)
>>> Heriot-Watt University, Edinburgh UK.
>>>
>>> Email: A.J.G.Gray@hw.ac.uk
>>> Web: http://www.macs.hw.ac.uk/~ajg33
>>> ORCID: http://orcid.org/0000-0002-5711-4872
>>> Office: Earl Mountbatten Building 1.39
>>> Twitter: @gray_alasdair
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
>>> campuses and students across the entire globe we span the world, delivering
>>> innovation and educational excellence in business, engineering, design and
>>> science.
>>>
>>> The contents of this e-mail (including any attachments) are
>>> confidential. If you are not the intended recipient of this e-mail, any
>>> disclosure, copying, distribution or use of its contents is strictly
>>> prohibited, and you should please notify the sender immediately and then
>>> delete it (including any attachments) from your system.
>>>
>>
>>
>>
>> --
>> Paul Houle
>>
>> *Applying Schemas for Natural Language Processing, Distributed Systems,
>> Classification and Text Mining and Data Lakes*
>>
>> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
>>
>> :BaseKB -- Query Freebase Data With SPARQL
>> http://basekb.com/gold/
>>
>> Legal Entity Identifier Lookup
>> https://legalentityidentifier.info/lei/lookup/
>> <http://legalentityidentifier.info/lei/lookup/>
>>
>> Join our Data Lakes group on LinkedIn
>> https://www.linkedin.com/grp/home?gid=8267275
>>
>>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275
Received on Wednesday, 8 June 2016 18:11:53 UTC