Re: RDF graph merging: How useful is it really? (was Re: Blank Nodes Re: Toward easier RDF: a proposal) from Paul Tyson on 2018-12-06 (semantic-web@w3.org from December 2018)

From: Paul Tyson <phtyson@sbcglobal.net>
Date: Thu, 06 Dec 2018 08:15:52 -0600
To: Hugh Glaser <hugh@glasers.org>
Cc: David Booth <david@dbooth.org>, semantic-web@w3.org
Message-ID: <1544105752.1575.10.camel@sbcglobal.net>
On Mon, 2018-12-03 at 10:51 +0000, Hugh Glaser wrote:
> Hi,
> 
> It does depend on what you are doing.
> If you want to reflect all the data from the sources in the app or whatever, then this may be the best way.
> Especially if there is at least a little commonality in the schema.
> 
> But if you only want fragments of each source, and their schema are greatly differing, I like to gather the selection and transformation in one place, and then have the knowledge I build the front end from clearly delineated, possibly in a different store.
> And of course the full source data may be commercially or personally sensitive, so it is very nice to have confidence that you are avoiding exposing the sensitive data because you can examine the lifting and convince yourself that the new store does not even contain it (and you can even run checks, possibly automated, to keep things that way.)
> 
> I’m afraid I don’t quite get how your queries can easily work over disparate datasets.
> Just finding the label to use for a URI is a nightmare.
> The datasets I see are likely have rdfs:label, dc(t):title, skos:(pref)Label bibo:xxx and a bunch I can’t recall just for a paper title, for example.
> And the one of those I want will be different for different source datasets (a couple may both have rdfs:label, but I would choose dc:title from one and skos:prefLabel from another because of the way they have been built).
> How do you cope with this?

I'm some years past developing this, so don't recall all such details,
but my apps did not depend too much on rdfs:label. But if I did need
some uniformity in that area, I would probably write some rules to
generate the uniform predicates and implement them with SPARQL INSERT or
CONSTRUCT operations.

As for the general problem of making queries across a heterogeneous
dataset, that does require understanding of the different schemas in
play. Since I had written the r2rml transformation specs for each
source, I knew them pretty well. I was a one-man team, so skimped on the
documentation, but it would not have been too hard to generate schema
documentation.

> Do you add the triples for your preferred label predicate in your RIF driven thingy? If so, what predicate will you be choosing to use, and how will you ignore that predicate if it appears in other datasets with data you don’t want? Or maybe you create an all-new predicate of your own for the labels?
> 
> Or do you embrace all the different predicates from the source as options in your realtime sparql queries?
> 
> I have a feeling that the use cases you are dealing with mean that my questions aren’t really relevant to you, but they may explain why I am currently doing it another way, for the stuff I am building.
> 

Perhaps. I didn't even think of my effort as "graph merging". It was
just an effective way to get a large subset of enterprise data behind a
sparql endpoint, which turned out to enable some useful applications.

Regards,
--Paul

> Best
> Hugh
> 
> > On 1 Dec 2018, at 21:28, Paul Tyson <phtyson@sbcglobal.net> wrote:
> > 
> > On Wed, 2018-11-28 at 21:58 +0000, Hugh Glaser wrote:
> >> Interesting.
> >> 
> >> This may be slightly off direct topic, but it is about how developers (me, in the case) do things, so possibly relevant.
> >> 
> >> I have moved to a slightly different way of doing things recently.
> >> In the context of building sites based on RDF data from a variety of sources.
> > 
> > Hugh, this looks similar to the approach I tumbled to some years ago. I
> > used to believe in the "unified schema" approach, or a hub-and-spoke
> > universal transformation system. Then I decided to take the enterprise
> > data as it was. This is not only easier, but pragmatic, because there's
> > always a reason--good bad or indifferent--why we find the data as it is.
> > Our job is not to reason why, or sanitize or improve it. The enterprise
> > just needs to make good use of it.
> > 
> > In my case the source was in SQL, so RDB2RDF fit the bill, applying as
> > little intelligence as possible, using the native table and field names
> > for uris in the default format prescribed by R2RML.
> > 
> > Where my approach differs from yours is that I don't do "lifting" per
> > se, but instead use SPARQL CONSTRUCT (mostly driven from RIF source) to
> > "enrich" the dataset by adding useful triples where needed to connect
> > disparate data sources or serve some application purpose.
> > 
> > To support web applications, I prefer realtime sparql queries, returned
> > as SRX, pipelined through xslt into HTML in a sort of linked data
> > approach. Another approach that I prototyped was client-side processing
> > with AJAX sparql to json to html. But all of it is built on the native
> > "messy" data from disparate sources. So what you save in data
> > transformations you can spend on delivering it usefully to consumers.
> > Yes of course you are still "transforming" the ugly data, but you do it
> > on a micro scale, flexibly and purposefully, rather than globally and
> > generally.
> > 
> > Regards,
> > --Paul
> >> 
> >> I used to process the stuff coming in, from csv, PDO, RDF or whatever, into the RDF I wanted as I imported it.
> >> Now, I am experimenting with doing it differently.
> >> I simply, and of course can quite quickly, convert the input into RDF using the most naive RDF structure possible.
> >> (At its simplest, for csv I would use a (constructed) URI for each row and a new predicate for each column, with the cell contents as object.)
> >> That is, the RDF is intended to capture all the source data, and only the source data.
> >> I stuff it in a Linked Data-enabled SPARQL endpoint (so that I have resolvable URIs for the source data records).
> >> If it is already in RDF, it may be that I can simply use the source servers themselves for this stage, if they provide the right services, and I am not breaking their terms of use doing a fair chunk of querying.
> >> 
> >> Then I create a process I also call “lifting” - it is different in detail to yours, but performs a similar function.
> >> I lift the data in the primitive RDF into the RDF that I want.
> >> And then I put it into one of my nice, clean stores from which the sites will be built.
> >> (The actual organisation of store granularity depends on source data size and other things.)
> >> 
> >> This seems to be great.
> >> My clean store has links to the source store records, which gives great provenance.
> >> I can also put links the other way, if I am actually using the source store for other things, which happens.
> >> I always know exactly what data I have acquired, and there is nothing (or little) hidden in the acquisition process.
> >> Any transformations are gathered into the one place of the lifting spec.
> >> I don’t have to look at bloody csv files or HTML source or whatever so much to work out what real RDF I want:- at that stage I look at my source RDF version using Linked Data (with a SPARQL endpoint as well) - what could be nicer than that? :-)
> >> Separation of concerns: I do the acquisition, and that is pretty much done; I can then experiment with exactly what RDF I want, without revisiting the source transformation.
> >> I can ignore any source data I don’t want. So, for example, with WikiData I can lose the fancy predicates and just keep the Direct ones, and for many sources I can discard all the variants of label predicates.
> >> 
> >> Perhaps the biggest thing is:
> >> The transformations can use all the knowledge I have been given. Very often when importing, you get things as records, or pages or whatever, about a single resource. But each record will have IDs for other things that are referenced elsewhere in the source dataset. If you have processed all the source data into a graph, you may have a lot more information about that resource and related resources to make better decisions about sameAsness, bNode identifiers etc.
> >> Basically, RDF is a great resource for doing data cleaning! :-)
> >> Get everything into RDF as soon as possible - then you can really think about it.
> >> 
> >> (Sorry if I reiterate stuff from your presentation, David, but I can’t see access it at the moment.)
> >> 
> >>> On 28 Nov 2018, at 17:04, David Booth <david@dbooth.org> wrote:
> >>> 
> >>> On 11/28/18 9:15 AM, Hugh Glaser wrote:
> >>>> RDF -> RDF [translation] is hugely important for building
> >>>> stuff, to remove stuff, or convert into preferred ontologies.
> >>> 
> >>> Agreed.  In my experience it's needed in almost every RDF
> >>> application.
> >>> 
> >>>> . .  If there were good tools to do this (or even one :-),
> >>>> or maybe there is), that integrated with what people use,
> >>>> would that be useful?
> >>> 
> >>> Yes!   I have often used SPARQL to perform RDF-->RDF
> >>> translation, though we also experimented with ShExMap and
> >>> JavaScript in a previous project.
> >>> 
> >>>> That would encourage a library of transformation specs,
> >>>> such as dc->dct, xxx->skos etc.
> >>> 
> >>> We also experimented with the idea of creating a mapping hub
> >>> for sharing translation rules.  It was agnostic about the
> >>> "rules" language (including ShExMap and JavaScript), used
> >>> github for storing/sharing the rules themselves, and provided
> >>> a front-end for categorizing/finding existing translation
> >>> rules.  The idea is described on slide 53 (also attached):
> >>> http://tinyurl.com/YosemiteRoadmap20150709slides
> >>> 
> >>> We also built a rough POC (but don't expect it to be fully
> >>> functional):
> >>> https://mappinghub.github.io/
> >>> I still think this mapping hub idea has a *lot* of merit.
> >>> 
> >>> David Booth
> >>> <slide53.pdf>
> >> 
> >> 
> > 
> > 
>
Received on Thursday, 6 December 2018 14:16:27 UTC