Re: Storing RDF in a relational database from Martynas Jusevičius on 2016-11-02 (semantic-web@w3.org from November 2016)

From: Martynas Jusevičius <martynas@graphity.org>
Date: Wed, 2 Nov 2016 17:41:47 +0100
To: "Li, Ai-jun" <Ai-jun.Li@morganstanley.com>
Cc: Nathan Rixham <nathan@webr3.org>, Bernadette Hyland <bhyland@3roundstones.com>, "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <CAE35Vmwm2V5XALn4WRX3u_a1oTP6Pu+mg9aUSvPq7V0PVt-o9g@mail.gmail.com>
Aijun,

another way to go about this would be to keep your relational structure and
then use OBDA (ontology-based data access) to provide a virtual RDF/SPARQL
layer. The most popular tool these days seems to be ontop:
http://ontop.inf.unibz.it/

Performance may or may not be better :)

On Wed, Nov 2, 2016 at 5:27 PM, Li, Ai-jun <Ai-jun.Li@morganstanley.com>
wrote:

> Hi All,
>
>
>
> Thanks for the interest. First, I must say that we didn't start by
> treating our data as graphs. Secondly, I got exposed to RDF very recently
> and still know very little about it. After being exposed to RDF, I realized
> that the way we are storing graph data handles provenance and reification
> very naturally. From what I understand, current RDF stores don't have a
> good way of handling these.
>
>
>
> It is easier to describe our approach by first analyzing and categorizing
> graph nodes and edges. If you look at any graph data in detail, you’ll find
> that not all nodes and edges are equal:
>
> ·         Some nodes are classification of things, e.g. *Hardware* in
> Personal Computer - Is_A - Hardware
>
> ·         Some nodes define types of things, e.g. *Personal Computer* in
> myPC - Is_A - Personal Computer
>
> ·         Some nodes define instances of a type, e.g. *myPC *is an
> instance of Personal Computer
>
> ·         Some nodes are properties/attributes of instances of things,
> e.g. *4* in myPC - Number_Of_CPU - 4
>
> ·         Some edges describe relationships, e.g. Aijun - * Owns* - myPC
>
> ·         Some edges describe properties, e.g. *Number_Of_CPU* in myPC -
> Number_Of_CPU - 4
>
> Internally, we use the ITIL term Configuration Item (CI) to represent
> thing. Corresponding to the above graph categorization, we have created the
> following main tables (not all tables are listed) :
>
> *Table*
>
> *Explanation*
>
> CI_Class_Category
>
> Metadata, stores the categorization nodes
>
> CI_Classes
>
> Metadata, stores the type nodes, has a foreign key to CI_Class_Category
>
> CI_Relationship_Templates
>
> Metadata, stores the allowed kinds of relationship edges (i.e. define what
> kind of relationships are allowed, a relationship has a CI class on the
> left and another on the right, and a relationship name)
>
> CI_Properties
>
> Metadata, stores the allowed/known property edges
>
> Data_Sources
>
> Metadata, stores info about data provider (provenance)
>
> CI_Instances
>
> *Main table*, stores instance nodes, contains: CI_Id, CI_Class_Id,
> Data_Source_Id, CI_Name, Create_Time, Last_Change_Time, etc.
>
> CI_Relationships
>
> *Main table*, stores the relationship edges, contains: Relationship_Id,
> Left_CI_Id, Right_CI_Id, Relationship_Type_Id, Data_Source_Id, Create_Time,
> etc.
>
> CI_Property_Values
>
> *Main table*, stores the property edges and values, contains: CI_Id,
> Property_Id, Property_Value, Data_Source_Id, timestamp
>
> CI_Relationship_Property_Values
>
> *Main table*, stores properties that describe relationships
> (reification), contains: Relationship_Id, Rel_Property_Id, Property_Value,
> Data_Source_Id and timestamp
>
>
>
> Today, we have over 200 CI classes (aka RDF types) and this number will
> keep on increasing. The number of database tables and their schemas,
> however, stayed the same since day one.
>
>
>
> Data consumption is a challenge. If I want to get a report of all the
> Personal Computers together with their properties, for example, I would
> need to join the CI_Instances and CI_Property_Values table many times (or
> do many sub-queries). Our solution is de-normalize the data to a reporting
> database. The reporting database contains a materialized table for each CI
> class. The schemas for the tables in the reporting database are created
> automatically based on CI class metadata. Data population and changes are
> triggered by changes in the normalized database.
>
>
>
> I realize that this may not be a very clear explanation of our approach.
> Hopefully you can some general ideas.
>
>
>
> Cheers,
>
> Aijun
>
>
>
>
>
>
>
> *From:* Nathan Rixham [mailto:nathan@webr3.org]
> *Sent:* Wednesday, November 02, 2016 11:24 AM
> *To:* Bernadette Hyland
> *Cc:* Li, Ai-jun (Enterprise Infrastructure); semantic-web@w3.org
> *Subject:* Re: Storing RDF in a relational database
>
>
>
> There are still many environments where custom / non pre-installed
> software is not available, environments where it would often be useful to
> have smaller graphs of 1m-1b triples. The vast majority of these provide
> RDBMS or Document-Object stores. Hence any proven approaches would be
> useful to have publicly available.
>
>
>
> On Wed, Nov 2, 2016 at 2:24 PM, Bernadette Hyland <
> bhyland@3roundstones.com> wrote:
>
> Hi Ai-jun,
>
> Not sure that storing RDF triples in a relational database is novel, at
> least not in 2016. And 300M isn’t a big number in the world of graph
> databases. For example, we’re working with a linked data repository,
> PubChem with 99B triples, and linking it to a subset of environmental
> linked open data. Point is, graph databases are a useful tool for specific
> jobs, just like RDBMS’s are great for other jobs.
>
>
>
> More importantly, getting triples out in a speedy manner, using a standard
> query language, and building a nice UI, is the part many people in the
> linked data community have spent 10+ years getting right.
>
>
>
> Just my 2 cents.
>
>
>
> Cheers,
>
>
>
> Bernadette Hyland
>
> CEO, 3 Round Stones, Inc.
>
>
>
>
>
>
>
> On Nov 2, 2016, at 04:11, Li, Ai-jun <Ai-jun.Li@morganstanley.com> wrote:
>
>
>
>
>
> I came across a very old request for comments for storing RDF data in
> relational database (http://infolab.stanford.edu/~melnik/rdf/db.html). I
> was unable to find any newer discussion on this. We had implemented a very
> innovative way of storing linked graph data in Sybase many years ago and
> the system is still being used today. The system is storing the equivalent
> of over 300 million triples and is scalable for much more. We’d be happy to
> share our approach if this is something the community is still interested
> in (will need to get the firm’s approval, obviously).
>
>
>
> Thanks,
>
> Ai-jun Li
>
> *Morgan Stanley | Enterprise Infrastructure    *1 New York Plaza, 16th
> Floor | New York, NY  10004
> Phone: +1 646 536-0765
> Ai-jun.Li@morganstanley.com
>
>
>
>
> ------------------------------
>
>
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. If you have received this
> communication in error, please destroy all electronic and paper copies and
> notify the sender immediately. Mistransmission is not intended to waive
> confidentiality or privilege. Morgan Stanley reserves the right, to the
> extent permitted under applicable law, to monitor electronic
> communications. This message is subject to terms available at the following
> link: http://www.morganstanley.com/disclaimers  If you cannot access
> these links, please notify us by reply message and we will send the
> contents to you. By communicating with Morgan Stanley you consent to the
> foregoing and to the voice recording of conversations with personnel of
> Morgan Stanley.
>
>
>
>
>
>
> ------------------------------
>
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. If you have received this
> communication in error, please destroy all electronic and paper copies and
> notify the sender immediately. Mistransmission is not intended to waive
> confidentiality or privilege. Morgan Stanley reserves the right, to the
> extent permitted under applicable law, to monitor electronic
> communications. This message is subject to terms available at the following
> link: http://www.morganstanley.com/disclaimers  If you cannot access
> these links, please notify us by reply message and we will send the
> contents to you. By communicating with Morgan Stanley you consent to the
> foregoing and to the voice recording of conversations with personnel of
> Morgan Stanley.
>
>
Received on Wednesday, 2 November 2016 16:42:22 UTC