Re: Slinging 'plus size' RDF (was Re: Storing RDF in a relational database) from Simon Spero on 2016-11-03 (semantic-web@w3.org from November 2016)

From: Simon Spero <sesuncedu@gmail.com>
Date: Wed, 2 Nov 2016 23:59:26 -0400
To: Bernadette Hyland <bhyland@3roundstones.com>
Cc: "Li, Ai-jun" <Ai-jun.Li@morganstanley.com>, semantic-web@w3.org, Andrew Woods <awoods@duraspace.org>, Linked Data Community <public-lod@w3.org>
Message-ID: <CADE8KM7_dXcczdkoexHSDCSG+nYLEoByGKX210io0JjOTu4VQw@mail.gmail.com>
1. Virtuoso is an SQL database designed  to support SPARQL & RDF. If the
underlying dataset has schema that is mostly regular, using tables can be a
big performance win over the straight triple store. Or it can be much
worse.  Also cold vs. warm caches require care when benchmarking (this
applies to just about every RDF store).

2. PubChem is a quite lovely dataset to work with when you only want some
of it (especially for non bio :)

Simon

On Nov 2, 2016 11:23 PM, "Bernadette Hyland" <bhyland@3roundstones.com>
wrote:

> Hi Andrew,
> I share this with the caveats that every app has unique requirements, and
> second, we all have a tendency to use technologies with which we’re
> familiar.
>
> In our case, we focus on linked data modeling and app development using
> Callimachus Enterprise.[1] Our team has OpenRDF Sesame chops, so we often
> use that store. Callimachus (OSS or Enterprise), is fanatical about
> RDF/SPARQL 1.1 compliance, and that is really the important part IMHO.
>
> Back to your question about slinging larger RDF bulk data - Recently, we
> needed to work with a data download (PubChem RDF weighs in at a hefty 99B
> triples), with a download size of about 40GB. The PubChem data stewards
> recommend that the database needs 64GB RAM and 500GB disk.[2]
>
> We thought we might blow the gaskets on OpenRDF Sesame, so we opted for
> Open Link Software's Virtuoso.[3] We installed Virtuoso on an AWS large
> instance to manipulate the 99B triples down to the more manageable dataset
> of around 6B triples (chemical synonyms and descriptors), that we needed.
> Worked very well.
>
> That said, if we need to scale up or our client has a preference for a
> specific triple store, we develop the UI layer using the Callimachus Web
> application server, which speaks to any SPARQL 1.1 compliant triple store
> on the server, e.g., MarkLogic, Ontotext GraphDB, others.
>
> We’ll typically prototype with OpenRDF Sesame, because we know it well,
> and then scale as required. FWIW, it took one developer < 1 day to
> integrate with a MarkLogic and GraphDB — because both of these database
> vendors are good about SPARQL 1.1 compliance.
>
> Note: We have no commercial relationship with any graph database company,
> in fact, we're database agnostic.
>
> In summary, we use Callimachus Enterprise to create applications using
> HTML5/CSS3, building named queries using SPARQL 1.1. For some apps, we’ll
> split up the data onto multiple OpenRDF Sesame instances, as required. If
> the customer wants to use / pay for a license to another SPARQL 1.1
> compliant persistent store, we’re all over it.
>
> Bottom line: If you go with vendors that make good on RDF/SPARQL 1.1
> standards compliance, you can sling some pretty hefty RDF and build nice
> UIs on top quickly.
>
> Anyone doing Linked Data beyond the prototyping phase is using some
> combination of OSS + commercially licensed products for the Web server/UI
> and persistent store layers.
>
> Hope that helps.
>
> Cheers,
>
> Bernadette Hyland
> bhyland@3roundstones.com ||  Skype BernHyland
>
> [1] http://callimachusproject.org/
>
> [2] https://pubchem.ncbi.nlm.nih.gov/rdf/#table2
>
> [3] https://virtuoso.openlinksw.com/dataspace/doc/(NULL)/wiki/Main/
>
>
> On Nov 3, 2016, at 00:38, Andrew Woods <awoods@duraspace.org> wrote:
>
> Hello Bernadette,
> Would you be willing to share the name of the triplestore implementation
> you are using to store 99B triples?
> Thanks,
> Andrew Woods
>
> On Wed, Nov 2, 2016 at 10:24 AM, Bernadette Hyland <
> bhyland@3roundstones.com> wrote:
>
>> Hi Ai-jun,
>> Not sure that storing RDF triples in a relational database is novel, at
>> least not in 2016. And 300M isn’t a big number in the world of graph
>> databases. For example, we’re working with a linked data repository,
>> PubChem with 99B triples, and linking it to a subset of environmental
>> linked open data. Point is, graph databases are a useful tool for specific
>> jobs, just like RDBMS’s are great for other jobs.
>>
>> More importantly, getting triples out in a speedy manner, using a
>> standard query language, and building a nice UI, is the part many people in
>> the linked data community have spent 10+ years getting right.
>>
>> Just my 2 cents.
>>
>> Cheers,
>>
>> Bernadette Hyland
>> CEO, 3 Round Stones, Inc.
>>
>>
>>
>> On Nov 2, 2016, at 04:11, Li, Ai-jun <Ai-jun.Li@morganstanley.com> wrote:
>>
>>
>> I came across a very old request for comments for storing RDF data in
>> relational database (http://infolab.stanford.edu/~melnik/rdf/db.html). I
>> was unable to find any newer discussion on this. We had implemented a very
>> innovative way of storing linked graph data in Sybase many years ago and
>> the system is still being used today. The system is storing the equivalent
>> of over 300 million triples and is scalable for much more. We’d be happy to
>> share our approach if this is something the community is still interested
>> in (will need to get the firm’s approval, obviously).
>>
>> Thanks,
>> Ai-jun Li
>>
>> *Morgan Stanley | Enterprise Infrastructure   *1 New York Plaza, 16th
>> Floor | New York, NY  10004
>> Phone: +1 646 536-0765
>> Ai-jun.Li@morganstanley.com
>>
>>
>>
>> ------------------------------
>>
>> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
>> opinions or views contained herein are not intended to be, and do not
>> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
>> Street Reform and Consumer Protection Act. If you have received this
>> communication in error, please destroy all electronic and paper copies and
>> notify the sender immediately. Mistransmission is not intended to waive
>> confidentiality or privilege. Morgan Stanley reserves the right, to the
>> extent permitted under applicable law, to monitor electronic
>> communications. This message is subject to terms available at the following
>> link: http://www.morganstanley.com/disclaimers  If you cannot access
>> these links, please notify us by reply message and we will send the
>> contents to you. By communicating with Morgan Stanley you consent to the
>> foregoing and to the voice recording of conversations with personnel of
>> Morgan Stanley.
>>
>>
>>
>
>
Received on Thursday, 3 November 2016 04:00:01 UTC