W3C home > Mailing lists > Public > public-lod@w3.org > May 2013

Re: Best way for exposing Linked Open Data. Wrapper vs scrape

From: Jeff Mixter <jeffmixter@gmail.com>
Date: Wed, 29 May 2013 09:58:10 -0400
Message-ID: <CAC=429AtDOjRd-WU8__KS5wpGmMNnhRTRnjz2huUtk09iJi5Uw@mail.gmail.com>
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: Luca Matteis <lmatteis@gmail.com>, Linked Data community <public-lod@w3.org>
I second Alfredo's suggestion of using d2rq for the relational database
datasets.  Additionally, if you have any CSV datasets, I would recommend
just converting them into XML using a simple script (I have used a 20 line
Python script in that past) and then just use XSL to convert the XML into
RDF.  From their you can use Jena to convert the RDF/XML into whatever
format of RDF you need.

Jeff Mixter


On Tue, May 28, 2013 at 8:09 AM, Hugh Glaser <hg@ecs.soton.ac.uk> wrote:

> Great questions (and answers).
> Easy way:
> Suck it all into a store and republish.
> But you didn't want it easy, or you wouldn't have asked here :-)
>
> It seems from your question that you are sort of in the role of technology
> provider for these users, so I'll assume that you have quite a lot of
> control over their systems.
> There are of course technical publishing issues, which others are
> addressing, as well as the socio-technical, as has been pointed out.
> I think the "correct" thing to do is to publish as closely as possible to
> the source, pushing out to the data providers.
> A few reasons spring to mind:
> It is more robust - there is no single point of failure.
> The dataset published can stay up to date.
> The users will start to engage with LD, which is great; in fact, the LD
> publishing can then become part of their workflow. And this means they have
> more awareness of the fact that the LD is available for their datasets, and
> they may even start to consume their own dataset, which is the point at
> which the dataset will no longer be vulnerable to rotting.
> If you build apps to consume (which I assume you actually want, otherwise
> why do this?), then you will provide a good monitor and check that things
> are properly webby; in fact, this is a really good "dog-food" reason for
> doing it as separate publishing.
> As my colleague Dave De Roure and I often say, let's put the Web in
> Semantic Web.
> Of course it will probably be more work for you to set up and also in the
> short term, but in the long run you may well find that you can disengage
> from some users' activities, reducing your work.
>
> One thing to consider from the start is what URIs?
> This is often the one thing that makes other decisions moot.
> Do your users or you care about what URI bases are used?
> If no-one does, then you can do it the easy way, by publishing at a base
> of your choice.
> If someone cares (and I think people should), then it gets more
> complicated.
> Should the users' datasets be published on a base owned by the users, such
> as http://data.userdomain.com?
> This might be a good thing to do.
> However, if you are scraping and centrally republishing, it then means
> that you need to have control over their routing, and you might need to be
> able to do fancy things to cope.
> On the other hand, it may be that your users are not able to actually have
> a decent base for their URIs (for example because of company restrictions),
> and in that case you clearly lean towards scraping and publishing.
>
> So, perhaps surprisingly, often the first thing to think about is the
> URIs, and then the social structures and technology follow.
>
> Best
> Hugh
>
> On 28 May 2013, at 10:22, Luca Matteis <lmatteis@gmail.com>
>  wrote:
>
> > Thanks, Jürgen. Are you at #eswc2013? Maybe we can talk about this face
> to face :-)
> > But anyway my two points were related to (i) letting my users do the
> work of publishing LOD or (ii) doing the work myself by aggregating their
> data.
> >
> > Cheers,
> > Luca
> >
> >
> > On Tue, May 28, 2013 at 11:07 AM, Jürgen Jakobitsch SWC <
> j.jakobitsch@semantic-web.at> wrote:
> > :-) experience shows that the technical aspect of your endeavor is
> > probably the simplest and you'll have a lot of time to think about it
> > until every group settles on a uri pattern and the vocabularies to be
> > used unless you go north-korean and impose such things...
> > when you have a couple of datasets the probability of one single
> > solution that fits all parties is very low.
> > such desicions depend on a lot of non-technical factors like willingness
> > to move to the rdf/semweb/linkeddata world, are there current workflows
> > that groups of people are using.
> >
> > technically it depends on things like dataset size, use cases (is it
> > enough to simply make this data dereferenceable, is there need to make
> > the data queryable (what kinds of queries, there are certain parts that
> > are quite difficult to implement when with sparql to sql, limit and top
> > in certain cases))
> >
> > i guess the => fastest <= (not necessarily the best) way would be to
> > create dumps (custom scripts, rdb2rdf) and put these into a virtuoso or
> > a triple store of your choice in combination with tools like
> > "pubby" [2]. then use "limes" or another tool to create links to other
> > lod sources. that way the change of peoples' behaviour is not a
> > requirement for success.
> >
> > wkr jürgen
> >
> > [1] http://aksw.org/Projects/LIMES.html
> > [2] http://wifo5-03.informatik.uni-mannheim.de/pubby/
> >
> > On Tue, 2013-05-28 at 10:18 +0200, Luca Matteis wrote:
> > > Here's my scenario: I have several different datasets. Most in MySQL
> > > databases. Some in PostrgreSQL. Others in MS Access. Many in CSV. Each
> > > one of these datasets is maintained by its own group of people.
> > >
> > >
> > > Now, my end goal is to have all these datasets published as 5 stars
> > > Linked Open Data. But I am in doubt between these two solutions:
> > >
> > >
> > > 1) Give a generic wrapper tool to each of these groups of people, that
> > > would basically convert their datasets to RDF, and allow them to
> > > publish this data as LOD automatically. This tool would allow them to
> > > publish LOD on their own, using their own server (does such a generic
> > > tool even exist? Can it even be built?).
> > >
> > >
> > > 2) Scrape these datasets, which are at times simply published on the
> > > Web as HTML paginated tables, or published as dumps on their server,
> > > for example a .CSV dump of their entire database. Then I would
> > > aggregate all these various datasets myself, and publish them as
> > > Linked Data.
> > >
> > >
> > > Pros and cons for each of these methods? Any other ideas?
> > >
> > >
> > > Thanks!
> >
> > --
> > | Jürgen Jakobitsch,
> > | Software Developer
> > | Semantic Web Company GmbH
> > | Mariahilfer Straße 70 / Neubaugasse 1, Top 8
> > | A - 1070 Wien, Austria
> > | Mob +43 676 62 12 710 | Fax +43.1.402 12 35 - 22
> >
> > COMPANY INFORMATION
> > | web       : http://www.semantic-web.at/
> > | foaf      : http://company.semantic-web.at/person/juergen_jakobitsch
> > PERSONAL INFORMATION
> > | web       : http://www.turnguard.com
> > | foaf      : http://www.turnguard.com/turnguard
> > | g+        : https://plus.google.com/111233759991616358206/posts
> > | skype     : jakobitsch-punkt
> > | xmlns:tg  = "http://www.turnguard.com/turnguard#"
> >
> >
>
>
>


-- 
Jeff Mixter
jeffmixter@gmail.com
440-773-9079
Received on Wednesday, 29 May 2013 13:58:42 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:21:44 UTC