- From: Jeff Mixter <jeffmixter@gmail.com>
- Date: Wed, 29 May 2013 09:58:10 -0400
- To: Hugh Glaser <hg@ecs.soton.ac.uk>
- Cc: Luca Matteis <lmatteis@gmail.com>, Linked Data community <public-lod@w3.org>
- Message-ID: <CAC=429AtDOjRd-WU8__KS5wpGmMNnhRTRnjz2huUtk09iJi5Uw@mail.gmail.com>
I second Alfredo's suggestion of using d2rq for the relational database datasets. Additionally, if you have any CSV datasets, I would recommend just converting them into XML using a simple script (I have used a 20 line Python script in that past) and then just use XSL to convert the XML into RDF. From their you can use Jena to convert the RDF/XML into whatever format of RDF you need. Jeff Mixter On Tue, May 28, 2013 at 8:09 AM, Hugh Glaser <hg@ecs.soton.ac.uk> wrote: > Great questions (and answers). > Easy way: > Suck it all into a store and republish. > But you didn't want it easy, or you wouldn't have asked here :-) > > It seems from your question that you are sort of in the role of technology > provider for these users, so I'll assume that you have quite a lot of > control over their systems. > There are of course technical publishing issues, which others are > addressing, as well as the socio-technical, as has been pointed out. > I think the "correct" thing to do is to publish as closely as possible to > the source, pushing out to the data providers. > A few reasons spring to mind: > It is more robust - there is no single point of failure. > The dataset published can stay up to date. > The users will start to engage with LD, which is great; in fact, the LD > publishing can then become part of their workflow. And this means they have > more awareness of the fact that the LD is available for their datasets, and > they may even start to consume their own dataset, which is the point at > which the dataset will no longer be vulnerable to rotting. > If you build apps to consume (which I assume you actually want, otherwise > why do this?), then you will provide a good monitor and check that things > are properly webby; in fact, this is a really good "dog-food" reason for > doing it as separate publishing. > As my colleague Dave De Roure and I often say, let's put the Web in > Semantic Web. > Of course it will probably be more work for you to set up and also in the > short term, but in the long run you may well find that you can disengage > from some users' activities, reducing your work. > > One thing to consider from the start is what URIs? > This is often the one thing that makes other decisions moot. > Do your users or you care about what URI bases are used? > If no-one does, then you can do it the easy way, by publishing at a base > of your choice. > If someone cares (and I think people should), then it gets more > complicated. > Should the users' datasets be published on a base owned by the users, such > as http://data.userdomain.com? > This might be a good thing to do. > However, if you are scraping and centrally republishing, it then means > that you need to have control over their routing, and you might need to be > able to do fancy things to cope. > On the other hand, it may be that your users are not able to actually have > a decent base for their URIs (for example because of company restrictions), > and in that case you clearly lean towards scraping and publishing. > > So, perhaps surprisingly, often the first thing to think about is the > URIs, and then the social structures and technology follow. > > Best > Hugh > > On 28 May 2013, at 10:22, Luca Matteis <lmatteis@gmail.com> > wrote: > > > Thanks, Jürgen. Are you at #eswc2013? Maybe we can talk about this face > to face :-) > > But anyway my two points were related to (i) letting my users do the > work of publishing LOD or (ii) doing the work myself by aggregating their > data. > > > > Cheers, > > Luca > > > > > > On Tue, May 28, 2013 at 11:07 AM, Jürgen Jakobitsch SWC < > j.jakobitsch@semantic-web.at> wrote: > > :-) experience shows that the technical aspect of your endeavor is > > probably the simplest and you'll have a lot of time to think about it > > until every group settles on a uri pattern and the vocabularies to be > > used unless you go north-korean and impose such things... > > when you have a couple of datasets the probability of one single > > solution that fits all parties is very low. > > such desicions depend on a lot of non-technical factors like willingness > > to move to the rdf/semweb/linkeddata world, are there current workflows > > that groups of people are using. > > > > technically it depends on things like dataset size, use cases (is it > > enough to simply make this data dereferenceable, is there need to make > > the data queryable (what kinds of queries, there are certain parts that > > are quite difficult to implement when with sparql to sql, limit and top > > in certain cases)) > > > > i guess the => fastest <= (not necessarily the best) way would be to > > create dumps (custom scripts, rdb2rdf) and put these into a virtuoso or > > a triple store of your choice in combination with tools like > > "pubby" [2]. then use "limes" or another tool to create links to other > > lod sources. that way the change of peoples' behaviour is not a > > requirement for success. > > > > wkr jürgen > > > > [1] http://aksw.org/Projects/LIMES.html > > [2] http://wifo5-03.informatik.uni-mannheim.de/pubby/ > > > > On Tue, 2013-05-28 at 10:18 +0200, Luca Matteis wrote: > > > Here's my scenario: I have several different datasets. Most in MySQL > > > databases. Some in PostrgreSQL. Others in MS Access. Many in CSV. Each > > > one of these datasets is maintained by its own group of people. > > > > > > > > > Now, my end goal is to have all these datasets published as 5 stars > > > Linked Open Data. But I am in doubt between these two solutions: > > > > > > > > > 1) Give a generic wrapper tool to each of these groups of people, that > > > would basically convert their datasets to RDF, and allow them to > > > publish this data as LOD automatically. This tool would allow them to > > > publish LOD on their own, using their own server (does such a generic > > > tool even exist? Can it even be built?). > > > > > > > > > 2) Scrape these datasets, which are at times simply published on the > > > Web as HTML paginated tables, or published as dumps on their server, > > > for example a .CSV dump of their entire database. Then I would > > > aggregate all these various datasets myself, and publish them as > > > Linked Data. > > > > > > > > > Pros and cons for each of these methods? Any other ideas? > > > > > > > > > Thanks! > > > > -- > > | Jürgen Jakobitsch, > > | Software Developer > > | Semantic Web Company GmbH > > | Mariahilfer Straße 70 / Neubaugasse 1, Top 8 > > | A - 1070 Wien, Austria > > | Mob +43 676 62 12 710 | Fax +43.1.402 12 35 - 22 > > > > COMPANY INFORMATION > > | web : http://www.semantic-web.at/ > > | foaf : http://company.semantic-web.at/person/juergen_jakobitsch > > PERSONAL INFORMATION > > | web : http://www.turnguard.com > > | foaf : http://www.turnguard.com/turnguard > > | g+ : https://plus.google.com/111233759991616358206/posts > > | skype : jakobitsch-punkt > > | xmlns:tg = "http://www.turnguard.com/turnguard#" > > > > > > > -- Jeff Mixter jeffmixter@gmail.com 440-773-9079
Received on Wednesday, 29 May 2013 13:58:42 UTC