Re: Best way for exposing Linked Open Data. Wrapper vs scrape

Great questions (and answers).
Easy way:
Suck it all into a store and republish.
But you didn't want it easy, or you wouldn't have asked here :-)

It seems from your question that you are sort of in the role of technology provider for these users, so I'll assume that you have quite a lot of control over their systems.
There are of course technical publishing issues, which others are addressing, as well as the socio-technical, as has been pointed out.
I think the "correct" thing to do is to publish as closely as possible to the source, pushing out to the data providers.
A few reasons spring to mind:
It is more robust - there is no single point of failure.
The dataset published can stay up to date.
The users will start to engage with LD, which is great; in fact, the LD publishing can then become part of their workflow. And this means they have more awareness of the fact that the LD is available for their datasets, and they may even start to consume their own dataset, which is the point at which the dataset will no longer be vulnerable to rotting.
If you build apps to consume (which I assume you actually want, otherwise why do this?), then you will provide a good monitor and check that things are properly webby; in fact, this is a really good "dog-food" reason for doing it as separate publishing.
As my colleague Dave De Roure and I often say, let's put the Web in Semantic Web.
Of course it will probably be more work for you to set up and also in the short term, but in the long run you may well find that you can disengage from some users' activities, reducing your work.

One thing to consider from the start is what URIs?
This is often the one thing that makes other decisions moot.
Do your users or you care about what URI bases are used?
If no-one does, then you can do it the easy way, by publishing at a base of your choice.
If someone cares (and I think people should), then it gets more complicated.
Should the users' datasets be published on a base owned by the users, such as http://data.userdomain.com?
This might be a good thing to do.
However, if you are scraping and centrally republishing, it then means that you need to have control over their routing, and you might need to be able to do fancy things to cope.
On the other hand, it may be that your users are not able to actually have a decent base for their URIs (for example because of company restrictions), and in that case you clearly lean towards scraping and publishing.

So, perhaps surprisingly, often the first thing to think about is the URIs, and then the social structures and technology follow.

Best
Hugh

On 28 May 2013, at 10:22, Luca Matteis <lmatteis@gmail.com>
 wrote:

> Thanks, Jürgen. Are you at #eswc2013? Maybe we can talk about this face to face :-)
> But anyway my two points were related to (i) letting my users do the work of publishing LOD or (ii) doing the work myself by aggregating their data.
> 
> Cheers,
> Luca
> 
> 
> On Tue, May 28, 2013 at 11:07 AM, Jürgen Jakobitsch SWC <j.jakobitsch@semantic-web.at> wrote:
> :-) experience shows that the technical aspect of your endeavor is
> probably the simplest and you'll have a lot of time to think about it
> until every group settles on a uri pattern and the vocabularies to be
> used unless you go north-korean and impose such things...
> when you have a couple of datasets the probability of one single
> solution that fits all parties is very low.
> such desicions depend on a lot of non-technical factors like willingness
> to move to the rdf/semweb/linkeddata world, are there current workflows
> that groups of people are using.
> 
> technically it depends on things like dataset size, use cases (is it
> enough to simply make this data dereferenceable, is there need to make
> the data queryable (what kinds of queries, there are certain parts that
> are quite difficult to implement when with sparql to sql, limit and top
> in certain cases))
> 
> i guess the => fastest <= (not necessarily the best) way would be to
> create dumps (custom scripts, rdb2rdf) and put these into a virtuoso or
> a triple store of your choice in combination with tools like
> "pubby" [2]. then use "limes" or another tool to create links to other
> lod sources. that way the change of peoples' behaviour is not a
> requirement for success.
> 
> wkr jürgen
> 
> [1] http://aksw.org/Projects/LIMES.html
> [2] http://wifo5-03.informatik.uni-mannheim.de/pubby/
> 
> On Tue, 2013-05-28 at 10:18 +0200, Luca Matteis wrote:
> > Here's my scenario: I have several different datasets. Most in MySQL
> > databases. Some in PostrgreSQL. Others in MS Access. Many in CSV. Each
> > one of these datasets is maintained by its own group of people.
> >
> >
> > Now, my end goal is to have all these datasets published as 5 stars
> > Linked Open Data. But I am in doubt between these two solutions:
> >
> >
> > 1) Give a generic wrapper tool to each of these groups of people, that
> > would basically convert their datasets to RDF, and allow them to
> > publish this data as LOD automatically. This tool would allow them to
> > publish LOD on their own, using their own server (does such a generic
> > tool even exist? Can it even be built?).
> >
> >
> > 2) Scrape these datasets, which are at times simply published on the
> > Web as HTML paginated tables, or published as dumps on their server,
> > for example a .CSV dump of their entire database. Then I would
> > aggregate all these various datasets myself, and publish them as
> > Linked Data.
> >
> >
> > Pros and cons for each of these methods? Any other ideas?
> >
> >
> > Thanks!
> 
> --
> | Jürgen Jakobitsch,
> | Software Developer
> | Semantic Web Company GmbH
> | Mariahilfer Straße 70 / Neubaugasse 1, Top 8
> | A - 1070 Wien, Austria
> | Mob +43 676 62 12 710 | Fax +43.1.402 12 35 - 22
> 
> COMPANY INFORMATION
> | web       : http://www.semantic-web.at/
> | foaf      : http://company.semantic-web.at/person/juergen_jakobitsch
> PERSONAL INFORMATION
> | web       : http://www.turnguard.com
> | foaf      : http://www.turnguard.com/turnguard
> | g+        : https://plus.google.com/111233759991616358206/posts
> | skype     : jakobitsch-punkt
> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
> 
> 

Received on Tuesday, 28 May 2013 12:11:30 UTC