Re: Best practice for exposing proprietary databases or services as SPARQL endpoints from adasal on 2010-12-17 (public-lod@w3.org from December 2010)

From: adasal <adam.saltiel@gmail.com>
Date: Fri, 17 Dec 2010 11:21:01 +0000
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: public-lod@w3.org, semantic-web@w3.org
Message-ID: <AANLkTim4Y43cvyVKMQauka_=9JFCN6nx-WCD+mFfqezV@mail.gmail.com>
Martin,
Thank you for the question as it and answers are very informative for me.
I am in the position of examining RDF alternatives or additional solutions
to the solution we are building.
However the architecture of this solution may be of interest with respect to
some of your requirements.

2. Scalability - the SPARQL endpoint must handle tens of thousands of
request per hour
3. Resource management for the endpoint - it must be possible to protect the
SPARQL endpoint from costly queries and return just a subset or refuse a
query
4. Resource management for the underlying RDBMS or Web services - it must be
possible to protect the original RDBMS and involved Web services from
excessive traffic (both willful ("Semantic DDoS") and unintentional (PhD
students' Python scrips gone wild).

2. Scalability - this is being tackled using a high availability pattern.
There will be a master read/write and several slave read only instances.
Each instance is otherwise a clone deployed in its own VM. The solution is
modularised and communication between modules is internal HTTP and between
instances is mediated behind a Varnish security layer.
3. Resource management for the endpoint  - Varnish also handles Edge Side
Includes, acting as a cache. This means that a ttl is placed in the header
of all served content and that the design relies on new queries being able
to assemble all or part of their content from the cache. Queries are JSON
and atomic elements have IDs which facilitates this.
4. Resource management for the underlying RDBMS or Web services - as 3. plus

   1. Careful design of the JSON queries. I am not sure how this might
   translate to SPARQL but the JSOn queries allow a meta query which returns
   what is available to be queried in a domain. In a domain you cannot just
   query for everything, or if you do (something ending in /*) a sensible
   subset is returned depending on the path to /*.
   2. Web services - this is different and would depend on the service and
   what you need from it. We have the problem of dynamic data and have not
   decided yet what combination of large queries to small queries, storing to
   the database and allowing the results to exist in the Varnish layer cache
   will be optimal. There is also the possibility of using the hibernate cache
   here. We have had several discussions and come up with tentative solutions,
   but I think that in the end we will develop a solution in the context of an
   architecture and development process that is flexible enough to allow for us
   to introduce other approaches as we understand the implication of the
   solution better. Here I am talking about performance and availability to the
   client plus good citizenship w.r.t. the web services.

I realise none of this is SPARQL world specific. I can't help with that, and
don't know how these ideas might transfer over.
Meanwhile I remain interested in this area and in automatic RDF creation (or
understanding RDF/OWL far better so that I can hand craft!) because our
solution has some inflexibility and requires a good deal of developer
effort.

HTH.

Best,
Adam

On 15 December 2010 01:21, Martin Hepp <martin.hepp@ebusiness-unibw.org>wrote:

> Dear all:
> Are there really no experiences beyond academic research regarding this
> task? I had assumed it was a pretty standard requirement...
>
> Best
>
> Martin
>
> On 11.12.2010, at 09:33, Martin Hepp wrote:
>
>  Dear all:
>>
>> There are many different ways of exposing existing relational databases as
>> SPARQL, e.g. as summarized by [1], namely Virtuoso's RDF Views, D2RQ, and
>> Triplify.
>>
>> I am looking for best practices / recommendations for the following
>> scenario:
>>
>> 1. There is a large and highly dynamic product or services database; part
>> of the data (e.g. prices) may even come from external Web services (think of
>> airfare, hotel prices).
>> 2. I want to make this accessible as a SPARQL endpoint using GoodRelations
>> and FOAF.
>> 3. The mapping from the original data structures to the proper RDF must be
>> hand-crafted anyway, so automation of this process is not important
>> 4. Creating RDF dumps is not feasible due to
>>
>> - the dynamics of the data
>> - the combinatorial complexity (not all combinations may be materialized
>> in the database; think of product variants).
>>
>> Key requirements for me are:
>>
>> 1. Maturity of the software (alpha / beta releases are no option)
>> 2. Scalability - the SPARQL endpoint must handle tens of thousands of
>> request per hour
>> 3. Resource management for the endpoint - it must be possible to protect
>> the SPARQL endpoint from costly queries and return just a subset or refuse a
>> query
>> 4. Resource management for the underlying RDBMS or Web services - it must
>> be possible to protect the original RDBMS and involved Web services from
>> excessive traffic (both willful ("Semantic DDoS") and unintentional (PhD
>> students' Python scrips gone wild).
>>
>> What would you recommend? My main point is really: Which tools /
>> architecture would you recommend if failure is not an option?
>>
>> Thanks for any opinions!
>>
>>
>> Best
>>
>> Martin
>>
>> [1] A Survey of Current Approaches for  Mapping of Relational Databases to
>> RDF (PDF), Satya S. Sahoo, Wolfgang Halb, Sebastian Hellmann, Kingsley
>> Idehen, Ted Thibodeau Jr, Sören Auer, Juan Sequeda, Ahmed Ezzat, 2009-01-31.
>> http://www.w3.org/2005/Incubator/rdb2rdf/RDB2RDF_SurveyReport.pdf
>>
>> --------------------------------------------------------
>> martin hepp
>> e-business & web science research group
>> universitaet der bundeswehr muenchen
>>
>> e-mail:  hepp@ebusiness-unibw.org
>> phone:   +49-(0)89-6004-4217
>> fax:     +49-(0)89-6004-4620
>> www:     http://www.unibw.de/ebusiness/ (group)
>>        http://www.heppnetz.de/ (personal)
>> skype:   mfhepp
>> twitter: mfhepp
>>
>> Check out GoodRelations for E-Commerce on the Web of Linked Data!
>> =================================================================
>> * Project Main Page: http://purl.org/goodrelations/
>> * Quickstart Guide for Developers: http://bit.ly/quickstart4gr
>> * Vocabulary Reference: http://purl.org/goodrelations/v1
>> * Developer's Wiki: http://www.ebusiness-unibw.org/wiki/GoodRelations
>> * Examples: http://bit.ly/cookbook4gr
>> * Presentations: http://bit.ly/grtalks
>> * Videos: http://bit.ly/grvideos
>>
>>
>>
>
>
Received on Friday, 17 December 2010 11:21:35 UTC