Slinging 'plus size' RDF (was Re: Storing RDF in a relational database) from Bernadette Hyland on 2016-11-03 (semantic-web@w3.org from November 2016)

From: Bernadette Hyland <bhyland@3roundstones.com>
Date: Thu, 3 Nov 2016 13:11:38 +1000
To: Andrew Woods <awoods@duraspace.org>
Cc: "Li, Ai-jun" <Ai-jun.Li@morganstanley.com>, "semantic-web@w3.org" <semantic-web@w3.org>, Linked Data Community <public-lod@w3.org>
Message-Id: <4E7D00F6-C3E6-4672-B5D9-39A2EA44DFEF@3roundstones.com>
Hi Andrew,
I share this with the caveats that every app has unique requirements, and second, we all have a tendency to use technologies with which we’re familiar. 

In our case, we focus on linked data modeling and app development using Callimachus Enterprise.[1] Our team has OpenRDF Sesame chops, so we often use that store. Callimachus (OSS or Enterprise), is fanatical about RDF/SPARQL 1.1 compliance, and that is really the important part IMHO.

Back to your question about slinging larger RDF bulk data - Recently, we needed to work with a data download (PubChem RDF weighs in at a hefty 99B triples), with a download size of about 40GB. The PubChem data stewards recommend that the database needs 64GB RAM and 500GB disk.[2]

We thought we might blow the gaskets on OpenRDF Sesame, so we opted for Open Link Software's Virtuoso.[3] We installed Virtuoso on an AWS large instance to manipulate the 99B triples down to the more manageable dataset of around 6B triples (chemical synonyms and descriptors), that we needed. Worked very well. 

That said, if we need to scale up or our client has a preference for a specific triple store, we develop the UI layer using the Callimachus Web application server, which speaks to any SPARQL 1.1 compliant triple store on the server, e.g., MarkLogic, Ontotext GraphDB, others.

We’ll typically prototype with OpenRDF Sesame, because we know it well, and then scale as required. FWIW, it took one developer < 1 day to integrate with a MarkLogic and GraphDB — because both of these database vendors are good about SPARQL 1.1 compliance.

Note: We have no commercial relationship with any graph database company, in fact, we're database agnostic.

In summary, we use Callimachus Enterprise to create applications using HTML5/CSS3, building named queries using SPARQL 1.1. For some apps, we’ll split up the data onto multiple OpenRDF Sesame instances, as required. If the customer wants to use / pay for a license to another SPARQL 1.1 compliant persistent store, we’re all over it.

Bottom line: If you go with vendors that make good on RDF/SPARQL 1.1 standards compliance, you can sling some pretty hefty RDF and build nice UIs on top quickly.

Anyone doing Linked Data beyond the prototyping phase is using some combination of OSS + commercially licensed products for the Web server/UI and persistent store layers. 

Hope that helps.

Cheers,

Bernadette Hyland
bhyland@3roundstones.com ||  Skype BernHyland  

[1] http://callimachusproject.org/ <http://callimachusproject.org/>

[2] https://pubchem.ncbi.nlm.nih.gov/rdf/#table2 <https://pubchem.ncbi.nlm.nih.gov/rdf/#table2>

[3] https://virtuoso.openlinksw.com/dataspace/doc/(NULL)/wiki/Main/ <https://virtuoso.openlinksw.com/dataspace/doc/(NULL)/wiki/Main/>


> On Nov 3, 2016, at 00:38, Andrew Woods <awoods@duraspace.org> wrote:
> 
> Hello Bernadette,
> Would you be willing to share the name of the triplestore implementation you are using to store 99B triples?
> Thanks,
> Andrew Woods
> 
> On Wed, Nov 2, 2016 at 10:24 AM, Bernadette Hyland <bhyland@3roundstones.com <mailto:bhyland@3roundstones.com>> wrote:
> Hi Ai-jun,
> Not sure that storing RDF triples in a relational database is novel, at least not in 2016. And 300M isn’t a big number in the world of graph databases. For example, we’re working with a linked data repository, PubChem with 99B triples, and linking it to a subset of environmental linked open data. Point is, graph databases are a useful tool for specific jobs, just like RDBMS’s are great for other jobs. 
> 
> More importantly, getting triples out in a speedy manner, using a standard query language, and building a nice UI, is the part many people in the linked data community have spent 10+ years getting right.
> 
> Just my 2 cents.
> 
> Cheers,
> 
> Bernadette Hyland
> CEO, 3 Round Stones, Inc.
> 
> 
> 
>> On Nov 2, 2016, at 04:11, Li, Ai-jun <Ai-jun.Li@morganstanley.com <mailto:Ai-jun.Li@morganstanley.com>> wrote:
>> 
>> 
>> I came across a very old request for comments for storing RDF data in relational database (http://infolab.stanford.edu/~melnik/rdf/db.html <http://infolab.stanford.edu/~melnik/rdf/db.html>). I was unable to find any newer discussion on this. We had implemented a very innovative way of storing linked graph data in Sybase many years ago and the system is still being used today. The system is storing the equivalent of over 300 million triples and is scalable for much more. We’d be happy to share our approach if this is something the community is still interested in (will need to get the firm’s approval, obviously).
>>  
>> Thanks,
>> Ai-jun Li   
>> Morgan Stanley | Enterprise Infrastructure   
>> 1 New York Plaza, 16th Floor | New York, NY  10004   
>> Phone: +1 646 536-0765 <tel:%2B1%20646%20536-0765>   
>> Ai-jun.Li@morganstanley.com <mailto:Ai-jun.Li@morganstanley.com>   
>>    
>> 
>> 
>> 
>> NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers <http://www.morganstanley.com/disclaimers>  If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.
> 
>
Received on Thursday, 3 November 2016 03:12:12 UTC