RE: Storing RDF in a relational database

Hi All,



Thanks for the interest. First, I must say that we didn't start by treating our data as graphs. Secondly, I got exposed to RDF very recently and still know very little about it. After being exposed to RDF, I realized that the way we are storing graph data handles provenance and reification very naturally. From what I understand, current RDF stores don't have a good way of handling these.



It is easier to describe our approach by first analyzing and categorizing graph nodes and edges. If you look at any graph data in detail, you’ll find that not all nodes and edges are equal:

·         Some nodes are classification of things, e.g. Hardware in Personal Computer - Is_A - Hardware

·         Some nodes define types of things, e.g. Personal Computer in myPC - Is_A - Personal Computer

·         Some nodes define instances of a type, e.g. myPC is an instance of Personal Computer

·         Some nodes are properties/attributes of instances of things, e.g. 4 in myPC - Number_Of_CPU - 4

·         Some edges describe relationships, e.g. Aijun - Owns - myPC

·         Some edges describe properties, e.g. Number_Of_CPU in myPC - Number_Of_CPU - 4
Internally, we use the ITIL term Configuration Item (CI) to represent thing. Corresponding to the above graph categorization, we have created the following main tables (not all tables are listed) :
Table

Explanation

CI_Class_Category

Metadata, stores the categorization nodes

CI_Classes

Metadata, stores the type nodes, has a foreign key to CI_Class_Category

CI_Relationship_Templates

Metadata, stores the allowed kinds of relationship edges (i.e. define what kind of relationships are allowed, a relationship has a CI class on the left and another on the right, and a relationship name)

CI_Properties

Metadata, stores the allowed/known property edges

Data_Sources

Metadata, stores info about data provider (provenance)

CI_Instances

Main table, stores instance nodes, contains: CI_Id, CI_Class_Id, Data_Source_Id, CI_Name, Create_Time, Last_Change_Time, etc.

CI_Relationships

Main table, stores the relationship edges, contains: Relationship_Id, Left_CI_Id, Right_CI_Id, Relationship_Type_Id, Data_Source_Id, Create_Time, etc.

CI_Property_Values

Main table, stores the property edges and values, contains: CI_Id, Property_Id, Property_Value, Data_Source_Id, timestamp

CI_Relationship_Property_Values

Main table, stores properties that describe relationships (reification), contains: Relationship_Id, Rel_Property_Id, Property_Value, Data_Source_Id and timestamp


Today, we have over 200 CI classes (aka RDF types) and this number will keep on increasing. The number of database tables and their schemas, however, stayed the same since day one.

Data consumption is a challenge. If I want to get a report of all the Personal Computers together with their properties, for example, I would need to join the CI_Instances and CI_Property_Values table many times (or do many sub-queries). Our solution is de-normalize the data to a reporting database. The reporting database contains a materialized table for each CI class. The schemas for the tables in the reporting database are created automatically based on CI class metadata. Data population and changes are triggered by changes in the normalized database.

I realize that this may not be a very clear explanation of our approach. Hopefully you can some general ideas.

Cheers,
Aijun



From: Nathan Rixham [mailto:nathan@webr3.org]
Sent: Wednesday, November 02, 2016 11:24 AM
To: Bernadette Hyland
Cc: Li, Ai-jun (Enterprise Infrastructure); semantic-web@w3.org
Subject: Re: Storing RDF in a relational database

There are still many environments where custom / non pre-installed software is not available, environments where it would often be useful to have smaller graphs of 1m-1b triples. The vast majority of these provide RDBMS or Document-Object stores. Hence any proven approaches would be useful to have publicly available.

On Wed, Nov 2, 2016 at 2:24 PM, Bernadette Hyland <bhyland@3roundstones.com<mailto:bhyland@3roundstones.com>> wrote:
Hi Ai-jun,
Not sure that storing RDF triples in a relational database is novel, at least not in 2016. And 300M isn’t a big number in the world of graph databases. For example, we’re working with a linked data repository, PubChem with 99B triples, and linking it to a subset of environmental linked open data. Point is, graph databases are a useful tool for specific jobs, just like RDBMS’s are great for other jobs.

More importantly, getting triples out in a speedy manner, using a standard query language, and building a nice UI, is the part many people in the linked data community have spent 10+ years getting right.

Just my 2 cents.

Cheers,

Bernadette Hyland
CEO, 3 Round Stones, Inc.



On Nov 2, 2016, at 04:11, Li, Ai-jun <Ai-jun.Li@morganstanley.com<mailto:Ai-jun.Li@morganstanley.com>> wrote:


I came across a very old request for comments for storing RDF data in relational database (http://infolab.stanford.edu/~melnik/rdf/db.html). I was unable to find any newer discussion on this. We had implemented a very innovative way of storing linked graph data in Sybase many years ago and the system is still being used today. The system is storing the equivalent of over 300 million triples and is scalable for much more. We’d be happy to share our approach if this is something the community is still interested in (will need to get the firm’s approval, obviously).

Thanks,
Ai-jun Li
Morgan Stanley | Enterprise Infrastructure
1 New York Plaza, 16th Floor | New York, NY  10004
Phone: +1 646 536-0765<tel:%2B1%20646%20536-0765>
Ai-jun.Li@morganstanley.com<mailto:Ai-jun.Li@morganstanley.com>



________________________________

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers  If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.




________________________________

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers  If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.

Received on Wednesday, 2 November 2016 16:27:41 UTC