Re: [External] : Future-proof modelling from Souripriya Das on 2023-01-23 (public-rdf-star-wg@w3.org from January 2023)

From: Souripriya Das <souripriya.das@oracle.com>
Date: Mon, 23 Jan 2023 13:46:51 +0000
To: Pierre-Antoine Champin <pierre-antoine@w3.org>
CC: RDF-star WG <public-rdf-star-wg@w3.org>
Message-ID: <SN4PR10MB562247B1BA1CEC62FD8A417BFACA9@SN4PR10MB5622.namprd10.prod.outlook.com>
Hi Pierre-Antoine,

Thank you for asking these questions. Let me try to answer the questions relevant to the first part of your email. Let's first see if we can agree that there is a problem with multi-edge handling that does not have the best solution today. Unless we can agree on that, there is no point in trying to pull our hair in analyzing the merits and demerits of the proposed (named triples based) solution.

In short:
If the user wants to change the schema (domains and ranges of properties), nobody can guarantee future-proofing. On the other hand, if new data is added that follows the existing definitions (domain and range) of properties, future-proofing should be preserved and use of named triples provides a way to do that.

Details:

Example 1:
Relational model is strong and powerful for structured data, but flexibility in moving seamlessly from single-valued attribute (or its equivalent in terms of foreign keys) is not its strong point. People try to handle such changes by defining views over the tables in the changed schema, but it has its issues. Let us focus on RDF. It is supposed to be flexible. It can handle moving from single-valued to multi-valued properties easily. Yet, statement about statement, which is easy to do with relational, is hard to represent in RDF. Additionally, RDF's "no duplicate triples" constraint prevents multi-edges unless special measures are taken by the data architect. RDF-star comes along and makes it easy to do statement about statement by using (<< s p o >> as a way to identify a triple and allow use of this "identifier" as subject or object in other triples), but it does not handle multi-edges well (needs extra triple that uses the :occurrenceOf property). There are shortcomings in relational model, RDF, and RDF-star. RDFn inherits the flexibility of RDF, ability to represent statement about statement from RDF-star, and then makes it easy to represent multi-edges. This then shows up in its ability to future-proof applications based on RDFn data and queries even in the face of unanticipated transitions of properties to multi-edge.

Example 2:
Moving from string to record (with fields for individual components) in this example is triggered by the desire of the user (e.g., data architect) to change the schema where the range of a property (:postalAddress) goes from string datatype to a class, say :PostalAddress, which is the domain for properties such as :streetNumber, streetName, :city, :state, and even a :geocode. The property :city may have the class :City as its range. With this change in the schema used for the data, the pre-existing queries, expecting a string as the value for the :postalAddress are completely broken and have to be rewritten and tested before making it available for use. Until that happens the old-schema version of data and queries have to stay in place so that there is no disruption suffered by the users of the application. (I hope no one imagines that one can do arbitrary remodeling of their data and still the pre-existing queries will be guaranteed to remain valid.)

Contrast the above with the type of situation I was trying to illustrate in my slides. There, the new data that is coming in is using the same properties with the same respective domains and ranges as before, but its arrival has caused the occurrence counts to go from one to greater-than-one for some of the properties (e.g., suppose that we just found out that :Taylor :married :Burton a second time -- something that never happened to the :married property before this). This addition became a reality in the world being modeled -- the data architect has no control over this. Such changes can and should be handled as seamlessly as possible -- pre-existing queries should retain their validity despite those changes. This is where named triples -- with support for both implicit and explicit names -- would come in handy.

Thanks,
Souri.

________________________________
From: Pierre-Antoine Champin
Sent: Friday, January 20, 2023 6:27 PM
To: Souripriya Das
Cc: RDF-star WG
Subject: [External] : Future-proof modelling

Dear Souri,

I wanted to react to your presentation during the RDF-star call
yesterday, especially about the "future proof modelling" argument.

Consider the following two examples:

Example 1: consider a relational data model with a table Person and a
table Company. The table Person contains a column "woksFor", that is a
foreign key to Company. At some point, we need to represent the fact
that a given person works for two different companies at the same time.
Currently, this requires changing the model (replacing the column
Person.worksFor by a new table WorksFor with 2 foreign keys to Person
and Company).
Following your logic at the extreme, this would be an argument to extend
the relational model to allow multiple values in a column, so that this
use-case could be accommodated without changing the original model.
This would make the relation model much more complex, and would probably
not be worth it.

Example 2: consider an RDF graph where a property :postalAddress has
domain :Person and range xsd:string. This is all very well, until
someone wants to describe addresses themselves (separate their different
"fields", link them to an entity of type city rather than a city name,
add geo-coordinates to an address...). This would require a change in
the model, where :postalAdress now points to an IRI or blank node, which
would carry original string in its rdf:value property, but could carry
additional properties as well.
Following your logic at the extreme, someone could argue in favor of
allowing string literals in the subject position, so that they could add
properties to the "address string" without changing the original model.
This would be a very bad idea, because it would be conflating strings
with the addresses that they represent.
(NB: my point here is not to say that "literals as subjects" is a bad
idea per se, but that this would be a bad solution to this particular
problem)

My point here is that remodeling can not always be avoided -- or that
avoiding it would overly complicate the model (example 1), or lead to
even worse modelling (example 2).

So yes, we should strive to make the user's life easier. But we must
keep in mind that this is a trade-off. The curse is sometimes worse than
the disease.

RDFn makes the inner model more complex (alla Example 1 above):

- it adds a "4th column" to every triple. IIUC, you seem to assume that
all implementations already deal with some for of triple identifier, all
we need is to expose it to the user. But I am not sure that all
implementations have such an internal identifier (I am actually pretty
sure that some don't).

- somehow, it turns graphs, that are currently sets of triples, into
multisets of triples. And multisets are tricky. What happens for example
when you merge two graphs containing an identical triple? Is it the same
triple? Two triples with different "default" identifiers? What appens
when you use SPARQL UPDATE to remove a triple? Do you remove only one of
them or all of them? Can of worm ahead...

   pa
Received on Monday, 23 January 2023 13:47:12 UTC