- From: Jerven Bolleman <me@jerven.eu>
- Date: Tue, 4 Jun 2013 18:10:03 +0200
- To: HCLS <public-semweb-lifesci@w3.org>, public-lod@w3.org
- Message-ID: <CAHM_hUP2RGrwO-6f5eiFuBt91o6WbUM7u1EzXJFrX=NjiOC6gA@mail.gmail.com>
3rd try now without attachments as some SPAM filter seems to reject this mail. Hope not everyone gets 4 copies now :( Regards, Jerven ---------- Forwarded message ---------- From: Jerven Bolleman <me@jerven.eu> Date: Mon, Jun 3, 2013 at 9:57 PM Subject: Fwd: Question about Semantic Web To: public-semweb-lifesci@w3.org Dear all, In my role as a UniProt developer I was asked a question about why use SPARQL+RDF. I thought it could be interesting for others on this list as well. Regards, Jerven Hi Chris, Thank you for your compliment, I will be giving another talk about this at the biohackathon 2013 (http://2013.biohackathon.org/documents/symposium). I hope this will also be made available on youtube by the kind DBCLS. *** The following is my personal opinion only! I wear some rose tinted glasses in relation to SPARQL. But that is just my blood from banging my head on the relational/flat file walls. *** I understand the NCBI policy makers. Many of the benefits of the semweb they heard before. Use ASN.1 its such a great standard. Oh sorry nearly no body uses ASN.1 use this XML thing instead, it will be so easy to query your data with Xpath. In the meantime most users use the flatfile genbank or medline files... And you can't really deprecate a format once published (at least not without an outcry). When I started at UniProt just over 5 years ago I thought the same. Oh great file format number 8 [1], do we really need another one? (I can already hear the sigh coming from some of the experienced NCBI developers). Today I say yes and the RDF one is the future of the UniProt formats (far future, but future nonetheless). Yet you must understand that using SPARQL or SQL is not an interesting change in terms of biological science. There is theoretically nothing possible using SPARQL that is not possible using SQL etc... or even clay tablets. The only thing that changes is the number of slaves, oops I mean PhD students that are needed to get a result. I claim SPARQL+RDF is more economical efficient in the aggregate than SQL. Which is why I support this move. The same reasons that programmers mostly moved from C/Fortran to Java or Perl and then in part to Ruby and Python. It is really hard to make the argument that moving to Perl from C was necessary for science reasons. The clear truth is that not needing to worry about memory allocations or basic datastructures allowed many more programs to be created. Sure you lose some efficiency at the CPU level but you gained a massive efficiency at the programmer level. This is great because the programmer is getting more and more expensive every year while a CPU ticks is decreasing in price all the time. Back to the NCBI where large databases keep on growing in size and even worse complexity. It is financially impossible for a small lab to fully integrate the knowledge contained in RefSeq or UniProt into their own data infrastructure. Especially if we include the need to keep their data up to date. Just understand that these databases are nearly terabytes in size when uncompressed and stored in a relational database and have a 100+ interlinked tables. And this is just 2 of the large-ish public databases. Even if you think this work is trivial why would the NIH pay hundreds of small labs to this work over and over again? And not just the NIH but all the other funding agencies? If they could fund 2 SPARQL endpoints that all of their users could use? Is this not a form of useful cloud computing? But of course you could say just make your SQL database available like UCSC for their genome browser. Many bioinformaticians would cheer this on. Yet there is one thing SPARQL has that SQL does not. SPARQL is practically standardized SQL is theoretically standardized. See the differences between DB2, Virtuoso, Vertica, Oracle, MySQL and Postgresql in practical terms. Is it "show tables" or "SELECT table_name FROM user_tables". Oh it was LIST TABLES ;(. Many SQL vendors don't even commit to supporting the ISO sql standards. Compare this to the SPARQL world. IBM and Oracle both fully support SPARQL 1 (Oracle even using 2 databases! Spatial and NoSQL) as well as Yarcdata (Cray), BigData (Systap), Virtuoso, Apache software foundation, Sesame (2), Ontotext, Clark&Parsia, Markdata and many more I can't think off. And for each the show tables is equivalent "SELECT DISTINCT(?type) WHERE (?s a ?type)". In 5 years since the standardization we actually have a lot of products that support the whole of the SPARQL standard, something that the SQL world has not managed in 21! I expect that of the above list at least 8 will be fully SPARQL 1.1. compliant by the end of summer. This means that a choice for a SPARQL database by the NCBI does not favor any database company. Also one team may have certain requirements of their datastore that others do not. Yet all of the datastores present the exact some API to your users:SPARQL. Which means that if RefSeq needs solution A then the Pubmed team can use solution B without negatively impacting your querying users. Lastly as my included presentation shows the final killer feature is the SERVICE keyword. Need to do analytical queries over two databases? No need to download all data just use their sparql endpoints and federated queries. In this case we used 2 different SPARQL solutions. UniProt using OWLIM and I think ALLIE and ChEMBL using virtuoso. The same works for querying between UniProt and Nature citation data even though their endpoint is using software from The Stationary Office (5Store, hah I could think of one more). Then what about the popularity of XML or RDF. RDF for UniProt is close to matching the popularity of XML and might have exceeded it (I will have to look at the latest logs). While the sparql endpoint only gets a 3500 queries a day, its not been advertised or even linked from the main uniprot.org website. This will stay this way for as long as the sparql endpoint is in beta (as the hardware started throwing ram errors last week it might be a while ;( ) . Yet, those queries are answered and most of them could not be answered with our full text indexes on uniprot.org or even our production SQL databases. Most importantly the SPARQL endpoint saved my bacon when a SAB member needed some very specific data pronto. Of course you have one important question and that is what does it cost to provide a SPARQL endpoint? This is a good and valid question. The answer of course depends... On a greenfield project I think given comparable experience among your staff RDF+SPARQL is cheaper and more performant than a SQL approach. Why is SPARQL cheaper than SQL when starting from scratch? 1. The graph nature of a SPARQL endpoint allows you to use it as a key-value store for your data at the same time as using it for your complex searches. 2. JSON-LD and SPARQL/JSON gives you a cacheable api for your Web2.0/Ajax website to use without custom programmer development. 3. You do not need to design a separate data interchange format you can just use RDF. 4. Competitively tendering, moving from one SPARQL endpoint software to another is days work. i.e. you get the same answer the only difference is the speed at which you get the answer. Even using JPA or hibernate evaluating many SQL stores is not that easy! Of course greenfield programming is rare and won't be the case for most projects at the NCBI. Yet even for old projects providing SPARQL/RDF can be worth it. Firstly its not that expensive to provide RDF besides your existing XML. One intern can make a great XSLT in a few months. You can make your SQL database available via SPARQL mapper. Even writing a SPARQL wrapper against CSV files is easy (days work for a good programmer). There are risk and costs involved in starting down the semantic web. The first risk is to introduce more semantics that your data. i.e. instead of converting from one (e.g. ASN.1) serialization to RDF you try to redesign your whole data model. The second risk is that you assume you can throw out your old infrastructure once the SPARQL based one is live. Assuming you can easily replace years of IT infrastructure in e.g. GenBank with a year of work on a SPARQL endpoint is false. I think it is relatively cheap to complement the existing infrastructure with simple direct RDF and SPARQL. The reality is that a format once published needs to be supported for a long time. Will the choice for SPARQL affect all your users in their day to day work. No, its just a nicer pipette for the data analysts. They are still going to complain about your data modeling, the bizarre exceptions from 1981 that were never fixed. That their queries are to slow and your documentation is useless. We are dealing with humans here, we can make things easier but they will still be struggling with the really hard parts of data quality. To conclude: 1. SPARQL is just cheaper for users than SQL or traditional solutions (if those solutions don't exist yet). 2. The ideal SPARQL world is closer to data heaven than the ideal SQL world. Hope you can use something of this long mail ;) Regards, Jerven Bolleman 1. Fasta, Fasta (canonical), flat-file, gff3, xml, CSV, excel, list http://www.slideshare.net/jervenbolleman/uni-protsparqlcloud On 01/06/13 17:44, Maloney, Christopher (NIH/NLM/NCBI) [C] wrote: > Hi, Jerven, > > Peter Cock pointed me in your direction. I watched a video of a > presentation you gave at the BioHackathon 2011 > (http://www.youtube.com/watch?v=AczWuWc4ua0) and it was very good. > Thanks for making that available. > > I work for NCBI, and we have been looking into the possibility of > providing more data in RDF format from some of our resources. Our > PubChem group has already begun (here, for example: > https://pubchem.ncbi.nlm.nih.gov/rest/rdf/substance/SID2244). > > I am new to Semanatic Web technologies, but have been trying to educate > myself. One question that seems to come up often is, are there examples > of real, tangible benefits from these systems? The policy-makers here > are, in general, ruthlessly practical, and do not usually commit to > something new unless it can be demonstrated clearly that our end-users > will benefit. > > So, I am wondering if you can say that the Uniprot RDF deployment has > produced such benefits, and if you have done any type of evaluations to > demonstrate these? Or, if you know of other bioinformatics or > publishing projects out on the Internet that you could say have produced > real value for end-users, above and beyond what might be achievable > through more traditional technologies? > > Thanks for your time! > > Chris Maloney > > NIH/NLM/NCBI (Contractor) > > Building 45, 5AN.24D-22 > > 301-594-2842 > -- ------------------------------------------------------------------- Jerven Bolleman Jerven.Bolleman@isb-sib.ch SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85 CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58 1211 Geneve 4, Switzerland www.isb-sib.ch - www.uniprot.org Follow us at https://twitter.com/#!/uniprot ------------------------------------------------------------------- -- Jerven Bolleman me@jerven.eu
Received on Tuesday, 4 June 2013 16:10:40 UTC