W3C home > Mailing lists > Public > semantic-web@w3.org > December 2018

Re: RDF graph merging: How useful is it really? (was Re: Blank Nodes Re: Toward easier RDF: a proposal)

From: Paul Tyson <phtyson@sbcglobal.net>
Date: Sat, 01 Dec 2018 15:28:46 -0600
Message-ID: <1543699726.1680.15.camel@sbcglobal.net>
To: Hugh Glaser <hugh@glasers.org>
Cc: David Booth <david@dbooth.org>, semantic-web@w3.org
On Wed, 2018-11-28 at 21:58 +0000, Hugh Glaser wrote:
> Interesting.
> 
> This may be slightly off direct topic, but it is about how developers (me, in the case) do things, so possibly relevant.
> 
> I have moved to a slightly different way of doing things recently.
> In the context of building sites based on RDF data from a variety of sources.

Hugh, this looks similar to the approach I tumbled to some years ago. I
used to believe in the "unified schema" approach, or a hub-and-spoke
universal transformation system. Then I decided to take the enterprise
data as it was. This is not only easier, but pragmatic, because there's
always a reason--good bad or indifferent--why we find the data as it is.
Our job is not to reason why, or sanitize or improve it. The enterprise
just needs to make good use of it.

In my case the source was in SQL, so RDB2RDF fit the bill, applying as
little intelligence as possible, using the native table and field names
for uris in the default format prescribed by R2RML.

Where my approach differs from yours is that I don't do "lifting" per
se, but instead use SPARQL CONSTRUCT (mostly driven from RIF source) to
"enrich" the dataset by adding useful triples where needed to connect
disparate data sources or serve some application purpose.

To support web applications, I prefer realtime sparql queries, returned
as SRX, pipelined through xslt into HTML in a sort of linked data
approach. Another approach that I prototyped was client-side processing
with AJAX sparql to json to html. But all of it is built on the native
"messy" data from disparate sources. So what you save in data
transformations you can spend on delivering it usefully to consumers.
Yes of course you are still "transforming" the ugly data, but you do it
on a micro scale, flexibly and purposefully, rather than globally and
generally.

Regards,
--Paul
> 
> I used to process the stuff coming in, from csv, PDO, RDF or whatever, into the RDF I wanted as I imported it.
> Now, I am experimenting with doing it differently.
> I simply, and of course can quite quickly, convert the input into RDF using the most naive RDF structure possible.
> (At its simplest, for csv I would use a (constructed) URI for each row and a new predicate for each column, with the cell contents as object.)
> That is, the RDF is intended to capture all the source data, and only the source data.
> I stuff it in a Linked Data-enabled SPARQL endpoint (so that I have resolvable URIs for the source data records).
> If it is already in RDF, it may be that I can simply use the source servers themselves for this stage, if they provide the right services, and I am not breaking their terms of use doing a fair chunk of querying.
> 
> Then I create a process I also call “lifting” - it is different in detail to yours, but performs a similar function.
> I lift the data in the primitive RDF into the RDF that I want.
> And then I put it into one of my nice, clean stores from which the sites will be built.
> (The actual organisation of store granularity depends on source data size and other things.)
> 
> This seems to be great.
> My clean store has links to the source store records, which gives great provenance.
> I can also put links the other way, if I am actually using the source store for other things, which happens.
> I always know exactly what data I have acquired, and there is nothing (or little) hidden in the acquisition process.
> Any transformations are gathered into the one place of the lifting spec.
> I don’t have to look at bloody csv files or HTML source or whatever so much to work out what real RDF I want:- at that stage I look at my source RDF version using Linked Data (with a SPARQL endpoint as well) - what could be nicer than that? :-)
> Separation of concerns: I do the acquisition, and that is pretty much done; I can then experiment with exactly what RDF I want, without revisiting the source transformation.
> I can ignore any source data I don’t want. So, for example, with WikiData I can lose the fancy predicates and just keep the Direct ones, and for many sources I can discard all the variants of label predicates.
> 
> Perhaps the biggest thing is:
> The transformations can use all the knowledge I have been given. Very often when importing, you get things as records, or pages or whatever, about a single resource. But each record will have IDs for other things that are referenced elsewhere in the source dataset. If you have processed all the source data into a graph, you may have a lot more information about that resource and related resources to make better decisions about sameAsness, bNode identifiers etc.
> Basically, RDF is a great resource for doing data cleaning! :-)
> Get everything into RDF as soon as possible - then you can really think about it.
> 
> (Sorry if I reiterate stuff from your presentation, David, but I can’t see access it at the moment.)
> 
> > On 28 Nov 2018, at 17:04, David Booth <david@dbooth.org> wrote:
> > 
> > On 11/28/18 9:15 AM, Hugh Glaser wrote:
> > > RDF -> RDF [translation] is hugely important for building
> > > stuff, to remove stuff, or convert into preferred ontologies.
> > 
> > Agreed.  In my experience it's needed in almost every RDF
> > application.
> > 
> > > . .  If there were good tools to do this (or even one :-),
> > > or maybe there is), that integrated with what people use,
> > > would that be useful?
> > 
> > Yes!   I have often used SPARQL to perform RDF-->RDF
> > translation, though we also experimented with ShExMap and
> > JavaScript in a previous project.
> > 
> > > That would encourage a library of transformation specs,
> > > such as dc->dct, xxx->skos etc.
> > 
> > We also experimented with the idea of creating a mapping hub
> > for sharing translation rules.  It was agnostic about the
> > "rules" language (including ShExMap and JavaScript), used
> > github for storing/sharing the rules themselves, and provided
> > a front-end for categorizing/finding existing translation
> > rules.  The idea is described on slide 53 (also attached):
> > http://tinyurl.com/YosemiteRoadmap20150709slides
> > 
> > We also built a rough POC (but don't expect it to be fully
> > functional):
> > https://mappinghub.github.io/
> > I still think this mapping hub idea has a *lot* of merit.
> > 
> > David Booth
> > <slide53.pdf>
> 
> 
Received on Saturday, 1 December 2018 21:29:15 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:57 UTC