Re: Slinging 'plus size' RDF (was Re: Storing RDF in a relational database) from Kingsley Idehen on 2016-11-03 (public-lod@w3.org from November 2016)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Thu, 3 Nov 2016 10:45:02 -0400
To: Simon Spero <sesuncedu@gmail.com>
Cc: semantic-web@w3.org, Linked Data Community <public-lod@w3.org>
Message-ID: <634d1b0b-586e-1e35-1e87-59cbbc07bd45@openlinksw.com>
On 11/2/16 11:59 PM, Simon Spero wrote:
>
> 1. Virtuoso is an SQL database designed  to support SPARQL & RDF. If
> the underlying dataset has schema that is mostly regular, using tables
> can be a big performance win over the straight triple store. Or it can
> be much worse.  Also cold vs. warm caches require care when
> benchmarking (this applies to just about every RDF store).
>
> 2. PubChem is a quite lovely dataset to work with when you only want
> some of it (especially for non bio :) 
>
> Simon
>

Simon,

+1 to that :)

Fundamentally, Virtuoso is a demonstration of what's possible with both
SPARQL and SQL using a single high-performance RDBMS. It also leverages
understanding of RDF-Language with regards to critical issues such as
data security and privacy using  Attribute-based Access controls.

I guess its time to produce a few posts about how you can extend SQL
(one standard) using SPARQL (another standard) with regards to powerful
data access and integration, without compromising security and privacy
etc..

[1]
http://kidehen.blogspot.com/2015/07/conceptual-data-virtualization-across.html
[2]
https://www.linkedin.com/pulse/dbpedia-201604-edition-kingsley-uyi-idehen
[3]
https://www.linkedin.com/pulse/reasoning-inference-using-british-royal-family-part-idehen
-- covers Custom (rather than in-built) Reasoning & Inference using
SPARQL as Rules Language (*note: this is part of the soon to be released
8.0 Edition of Virtuoso) .

Kingsley
>
> On Nov 2, 2016 11:23 PM, "Bernadette Hyland" <bhyland@3roundstones.com
> <mailto:bhyland@3roundstones.com>> wrote:
>
>     Hi Andrew,
>     I share this with the caveats that every app has unique
>     requirements, and second, we all have a tendency to use
>     technologies with which we’re familiar. 
>
>     In our case, we focus on linked data modeling and app development
>     using Callimachus Enterprise.[1] Our team has OpenRDF Sesame
>     chops, so we often use that store. Callimachus (OSS or
>     Enterprise), is fanatical about RDF/SPARQL 1.1 compliance, and
>     that is really the important part IMHO.
>
>     Back to your question about slinging larger RDF bulk data -
>     Recently, we needed to work with a data download (PubChem RDF
>     weighs in at a hefty 99B triples), with a download size of about
>     40GB. The PubChem data stewards recommend that the database needs
>     64GB RAM and 500GB disk.[2]
>
>     We thought we might blow the gaskets on OpenRDF Sesame, so we
>     opted for Open Link Software's Virtuoso.[3] We installed Virtuoso
>     on an AWS large instance to manipulate the 99B triples down to the
>     more manageable dataset of around 6B triples (chemical synonyms
>     and descriptors), that we needed. Worked very well. 
>
>     That said, if we need to scale up or our client has a preference
>     for a specific triple store, we develop the UI layer using the
>     Callimachus Web application server, which speaks to any SPARQL 1.1
>     compliant triple store on the server, e.g., MarkLogic, Ontotext
>     GraphDB, others.
>
>     We’ll typically prototype with OpenRDF Sesame, because we know it
>     well, and then scale as required. FWIW, it took one developer < 1
>     day to integrate with a MarkLogic and GraphDB — because both of
>     these database vendors are good about SPARQL 1.1 compliance.
>
>     Note: We have no commercial relationship with any graph database
>     company, in fact, we're database agnostic.
>
>     In summary, we use Callimachus Enterprise to create applications
>     using HTML5/CSS3, building named queries using SPARQL 1.1. For
>     some apps, we’ll split up the data onto multiple OpenRDF Sesame
>     instances, as required. If the customer wants to use / pay for a
>     license to another SPARQL 1.1 compliant persistent store, we’re
>     all over it.
>
>     Bottom line: If you go with vendors that make good on RDF/SPARQL
>     1.1 standards compliance, you can sling some pretty hefty RDF and
>     build nice UIs on top quickly.
>
>     Anyone doing Linked Data beyond the prototyping phase is using
>     some combination of OSS + commercially licensed products for the
>     Web server/UI and persistent store layers. 
>
>     Hope that helps.
>
>     Cheers,
>
>     Bernadette Hyland
>     bhyland@3roundstones.com <mailto:bhyland@3roundstones.com> ||
>      Skype BernHyland  
>
>     [1] http://callimachusproject.org/ <http://callimachusproject.org/>
>
>     [2] https://pubchem.ncbi.nlm.nih.gov/rdf/#table2
>     <https://pubchem.ncbi.nlm.nih.gov/rdf/#table2>
>
>     [3] https://virtuoso.openlinksw.com/dataspace/doc/(NULL)/wiki/Main/
>     <https://virtuoso.openlinksw.com/dataspace/doc/%28NULL%29/wiki/Main/>
>
>
>>     On Nov 3, 2016, at 00:38, Andrew Woods <awoods@duraspace.org
>>     <mailto:awoods@duraspace.org>> wrote:
>>
>>     Hello Bernadette,
>>     Would you be willing to share the name of the triplestore
>>     implementation you are using to store 99B triples?
>>     Thanks,
>>     Andrew Woods
>>
>>     On Wed, Nov 2, 2016 at 10:24 AM, Bernadette Hyland
>>     <bhyland@3roundstones.com <mailto:bhyland@3roundstones.com>> wrote:
>>
>>         Hi Ai-jun,
>>         Not sure that storing RDF triples in a relational database is
>>         novel, at least not in 2016. And 300M isn’t a big number in
>>         the world of graph databases. For example, we’re working with
>>         a linked data repository, PubChem with 99B triples, and
>>         linking it to a subset of environmental linked open data.
>>         Point is, graph databases are a useful tool for specific
>>         jobs, just like RDBMS’s are great for other jobs. 
>>
>>         More importantly, getting triples out in a speedy manner,
>>         using a standard query language, and building a nice UI, is
>>         the part many people in the linked data community have spent
>>         10+ years getting right.
>>
>>         Just my 2 cents.
>>
>>         Cheers,
>>
>>         Bernadette Hyland
>>         CEO, 3 Round Stones, Inc.
>>
>>
>>
>>>         On Nov 2, 2016, at 04:11, Li, Ai-jun
>>>         <Ai-jun.Li@morganstanley.com
>>>         <mailto:Ai-jun.Li@morganstanley.com>> wrote:
>>>
>>>
>>>         I came across a very old request for comments for storing
>>>         RDF data in relational database
>>>         (http://infolab.stanford.edu/~melnik/rdf/db.html
>>>         <http://infolab.stanford.edu/%7Emelnik/rdf/db.html>). I was
>>>         unable to find any newer discussion on this. We had
>>>         implemented a very innovative way of storing linked graph
>>>         data in Sybase many years ago and the system is still being
>>>         used today. The system is storing the equivalent of over 300
>>>         million triples and is scalable for much more. We’d be happy
>>>         to share our approach if this is something the community is
>>>         still interested in (will need to get the firm’s approval,
>>>         obviously).
>>>          
>>>         Thanks,
>>>         Ai-jun Li   
>>>         *Morgan Stanley | Enterprise Infrastructure   
>>>         *1 New York Plaza, 16th Floor | New York, NY  10004   
>>>         Phone: +1 646 536-0765 <tel:%2B1%20646%20536-0765>   
>>>         Ai-jun.Li@morganstanley.com
>>>         <mailto:Ai-jun.Li@morganstanley.com>   
>>>            
>>>
>>>
>>>         ------------------------------------------------------------------------
>>>
>>>         NOTICE: Morgan Stanley is not acting as a municipal advisor
>>>         and the opinions or views contained herein are not intended
>>>         to be, and do not constitute, advice within the meaning of
>>>         Section 975 of the Dodd-Frank Wall Street Reform and
>>>         Consumer Protection Act. If you have received this
>>>         communication in error, please destroy all electronic and
>>>         paper copies and notify the sender immediately.
>>>         Mistransmission is not intended to waive confidentiality or
>>>         privilege. Morgan Stanley reserves the right, to the extent
>>>         permitted under applicable law, to monitor electronic
>>>         communications. This message is subject to terms available
>>>         at the following
>>>         link: http://www.morganstanley.com/disclaimers
>>>         <http://www.morganstanley.com/disclaimers>  If you cannot
>>>         access these links, please notify us by reply message and we
>>>         will send the contents to you. By communicating with Morgan
>>>         Stanley you consent to the foregoing and to the voice
>>>         recording of conversations with personnel of Morgan Stanley.
>>
>>
>


-- 
Regards,

Kingsley Idehen       
Founder & CEO 
OpenLink Software   (Home Page: http://www.openlinksw.com)

Weblogs (Blogs):
Legacy Blog: http://www.openlinksw.com/blog/~kidehen/
Blogspot Blog: http://kidehen.blogspot.com
Medium Blog: https://medium.com/@kidehen

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/dataspace/person/kidehen#this
        : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
Attachments

application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Thursday, 3 November 2016 14:45:42 UTC