Fwd: Question about Semantic Web from Jerven Bolleman on 2013-06-04 (public-lod@w3.org from June 2013)

From: Jerven Bolleman <me@jerven.eu>
Date: Tue, 4 Jun 2013 18:10:03 +0200
To: HCLS <public-semweb-lifesci@w3.org>, public-lod@w3.org
Message-ID: <CAHM_hUP2RGrwO-6f5eiFuBt91o6WbUM7u1EzXJFrX=NjiOC6gA@mail.gmail.com>
3rd try now without attachments as some SPAM filter seems to reject this
mail.
Hope not everyone gets 4 copies now :(

Regards,
Jerven


---------- Forwarded message ----------
From: Jerven Bolleman <me@jerven.eu>
Date: Mon, Jun 3, 2013 at 9:57 PM
Subject: Fwd: Question about Semantic Web
To: public-semweb-lifesci@w3.org


Dear all,

In my role as a UniProt developer I was asked a question about why use
SPARQL+RDF. I thought it could be interesting for others on this list as
well.

Regards,
Jerven

Hi Chris,

Thank you for your compliment, I will be giving another talk about this
at the biohackathon 2013
(http://2013.biohackathon.org/documents/symposium). I hope this will
also be made available on youtube by the kind DBCLS.


                                ***
                The following is my personal opinion only!
                I wear some rose tinted glasses in relation
                to SPARQL. But that is just my blood from
                banging my head on the relational/flat file
                walls.
                                ***


I understand the NCBI policy makers. Many of the benefits of the semweb
they heard before. Use ASN.1 its such a great standard. Oh sorry nearly
no body uses ASN.1 use this XML thing instead, it will be so easy to
query your data with Xpath. In the meantime most users use the flatfile
genbank or medline files... And you can't really deprecate a format once
published (at least not without an outcry).

When I started at UniProt just over 5 years ago I thought the same. Oh
great file format number 8 [1], do we really need another one? (I can
already hear the sigh coming from some of the experienced NCBI developers).
Today I say yes and the RDF one is the future of the UniProt formats
(far future, but future nonetheless).

Yet you must understand that using SPARQL or SQL is not an interesting
change in terms of biological science. There is theoretically nothing
possible using SPARQL that is not possible using SQL etc... or even clay
tablets. The only thing that changes is the number of slaves, oops I
mean PhD students that are needed to get a result. I claim SPARQL+RDF is
more economical efficient in the aggregate than SQL. Which is why I
support this move.

The same reasons that programmers mostly moved from C/Fortran to Java or
Perl and then in part to Ruby and Python. It is really hard to make the
argument that moving to Perl from C was necessary for science reasons.
The clear truth is that not needing to worry about memory allocations or
basic datastructures allowed many more programs to be created. Sure you
lose some efficiency at the CPU level but you gained a massive
efficiency at the programmer level. This is great because the programmer
is getting more and more expensive every year while a CPU ticks is
decreasing in price all the time.

Back to the NCBI where large databases keep on growing in size and even
worse complexity. It is financially impossible for a small lab to fully
integrate the knowledge contained in RefSeq or UniProt into their own
data infrastructure. Especially if we include the need to keep their
data up to date. Just understand that these databases are nearly
terabytes in size when uncompressed and stored in a relational database
and have a 100+ interlinked tables. And this is just 2 of the large-ish
public databases. Even if you think this work is trivial why would the
NIH pay hundreds of small labs to this work over and over again? And not
just the NIH but all the other funding agencies? If they could fund 2
SPARQL endpoints that all of their users could use? Is this not a form
of useful cloud computing?

But of course you could say just make your SQL database available like
UCSC for their genome browser. Many bioinformaticians would cheer this
on. Yet there is one thing SPARQL has that SQL does not. SPARQL is
practically standardized SQL is theoretically standardized. See the
differences between DB2, Virtuoso, Vertica, Oracle, MySQL and Postgresql
in practical terms. Is it "show tables" or "SELECT table_name FROM
user_tables". Oh it was LIST TABLES ;(.
Many SQL vendors don't even commit to supporting the ISO sql standards.

Compare this to the SPARQL world. IBM and Oracle both fully support
SPARQL 1 (Oracle even using 2 databases! Spatial and NoSQL) as well as
Yarcdata (Cray), BigData (Systap), Virtuoso, Apache software foundation,
Sesame (2), Ontotext, Clark&Parsia, Markdata and many more I can't think
off. And for each the show tables is equivalent
"SELECT DISTINCT(?type) WHERE (?s a ?type)". In 5 years since the
standardization we actually have a lot of products that support the
whole of the SPARQL standard, something that the SQL world has not
managed in 21! I expect that of the above list at least 8 will be fully
SPARQL 1.1. compliant by the end of summer.

This means that a choice for a SPARQL database by the NCBI does not
favor any database company. Also one team may have certain requirements
of their datastore that others do not. Yet all of the datastores present
the exact some API to your users:SPARQL. Which means that if RefSeq
needs solution A then the Pubmed team can use solution B without
negatively impacting your querying users.

Lastly as my included presentation shows the final killer feature is the
SERVICE keyword. Need to do analytical queries over two databases?
No need to download all data just use their sparql endpoints and
federated queries. In this case we used 2 different SPARQL solutions.
UniProt using OWLIM and I think ALLIE and ChEMBL using virtuoso. The
same works for querying between UniProt and Nature citation data even
though their endpoint is using software from The Stationary Office
(5Store, hah I could think of one more).

Then what about the popularity of XML or RDF. RDF for UniProt is close
to matching the popularity of XML and might have exceeded it (I will
have to look at the latest logs). While the sparql endpoint only gets a
3500 queries a day, its not been advertised or even linked from the main
uniprot.org website. This will stay this way for as long as the sparql
endpoint is in beta (as the hardware started throwing ram errors last
week it might be a while ;( ) . Yet, those queries are answered and most
of them could not be answered with our full text indexes on uniprot.org
or even our production SQL databases. Most importantly the SPARQL
endpoint saved my bacon when a SAB member needed some very specific data
pronto.

Of course you have one important question and that is what does it cost
to provide a SPARQL endpoint? This is a good and valid question. The
answer of course depends...

On a greenfield project I think given comparable experience among your
staff RDF+SPARQL is cheaper and more performant than a SQL approach.

Why is SPARQL cheaper than SQL when starting from scratch?
1. The graph nature of a SPARQL endpoint allows you to use it as a
key-value store for your data at the same time as using it for your
complex searches.
2. JSON-LD and SPARQL/JSON gives you a cacheable api for your
Web2.0/Ajax website to use without custom programmer development.
3. You do not need to design a separate data interchange format you can
just use RDF.
4. Competitively tendering, moving from one SPARQL endpoint software to
another is days work. i.e. you get the same answer the only difference
is the speed at which you get the answer.
   Even using JPA or hibernate evaluating many SQL stores is not that easy!

Of course greenfield programming is rare and won't be the case for most
projects at the NCBI. Yet even for old projects providing SPARQL/RDF can
be worth it. Firstly its not that expensive to provide RDF besides your
existing XML. One intern can make a great XSLT in a few months.
You can make your SQL database available via SPARQL mapper. Even writing
a SPARQL wrapper against CSV files is easy (days work for a good
programmer).

There are risk and costs involved in starting down the semantic web. The
first risk is to introduce more semantics that your data. i.e. instead
of converting from one (e.g. ASN.1) serialization to RDF you try to
redesign your whole data model. The second risk is that you assume you
can throw out your old infrastructure once the SPARQL based one is live.
Assuming you can easily replace years of IT infrastructure in e.g.
GenBank with a year of work on a SPARQL endpoint is false. I think it is
relatively cheap to complement the existing infrastructure with simple
direct RDF and SPARQL. The reality is that
a format once published needs to be supported for a long time.

Will the choice for SPARQL affect all your users in their day to day
work. No, its just a nicer pipette for the data analysts. They are still
going to complain about your data modeling, the bizarre exceptions from
1981 that were never fixed. That their queries are to slow and your
documentation is useless. We are dealing with humans here, we can make
things easier but they will still be struggling with the really hard
parts of data quality.

To conclude:
1. SPARQL is just cheaper for users than SQL or traditional solutions
(if those solutions don't exist yet).
2. The ideal SPARQL world is closer to data heaven than the ideal SQL world.

Hope you can use something of this long mail ;)

Regards,
Jerven Bolleman


1. Fasta, Fasta (canonical), flat-file, gff3, xml, CSV,  excel, list

http://www.slideshare.net/jervenbolleman/uni-protsparqlcloud

On 01/06/13 17:44, Maloney, Christopher (NIH/NLM/NCBI) [C] wrote:
> Hi, Jerven,
>
> Peter Cock pointed me in your direction.  I watched a video of a
> presentation you gave at the BioHackathon 2011
> (http://www.youtube.com/watch?v=AczWuWc4ua0) and it was very good.
> Thanks for making that available.
>
> I work for NCBI, and we have been looking into the possibility of
> providing more data in RDF format from some of our resources.  Our
> PubChem group has already begun (here, for example:
> https://pubchem.ncbi.nlm.nih.gov/rest/rdf/substance/SID2244).
>
> I am new to Semanatic Web technologies, but have been trying to educate
> myself.  One question that seems to come up often is, are there examples
> of real, tangible benefits from these systems?  The policy-makers here
> are, in general, ruthlessly practical, and do not usually commit to
> something new unless it can be demonstrated clearly that our end-users
> will benefit.
>
> So, I am wondering if you can say that the Uniprot RDF deployment has
> produced such benefits, and if you have done any type of evaluations to
> demonstrate these?  Or, if you know of other bioinformatics or
> publishing projects out on the Internet that you could say have produced
> real value for end-users, above and beyond what might be achievable
> through more traditional technologies?
>
> Thanks for your time!
>
> Chris Maloney
>
> NIH/NLM/NCBI (Contractor)
>
> Building 45, 5AN.24D-22
>
> 301-594-2842
>


--
-------------------------------------------------------------------
 Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
 SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
 CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
 1211 Geneve 4,
 Switzerland     www.isb-sib.ch - www.uniprot.org
 Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------










-- 
Jerven Bolleman
me@jerven.eu
Received on Tuesday, 4 June 2013 16:10:40 UTC