Re: Yes it can and should ;) Re: Can RDFstar be defined as only syntactic sugar on top of RDF (Re: weakness of embedded triples) from Holger Knublauch on 2020-10-29 (public-rdf-star@w3.org from October 2020)

From: Holger Knublauch <holger@topquadrant.com>
Date: Thu, 29 Oct 2020 13:11:36 +1000
To: public-rdf-star@w3.org
Message-ID: <7da9bde8-9222-112f-58e1-5258b2c61483@topquadrant.com>
On 10/28/2020 9:29 PM, Jerven Bolleman wrote:

> Hi All,
>
> Yes it can be defined as syntactic sugar, and IMO it should. 

I agree. And this is not merely a matter of allowing RDF* to be 
implemented by special IRIs as one implementation strategy among others, 
but it should become the only permitted implementation strategy. The 
reason is that operators such as isIRI() need to work consistently 
across implementations. isIRI() should return true for embedded triples. 
There should not be a new RDF node type because that is simply not 
needed. No application should break because it encounters RDF* graphs. 
Whether it can make sense and interpret them as "reifications" (e.g. for 
display purposes) should be for them to decide incrementally. But by 
default they are just URIs to them.

Having said this, the spec could remain vague about how these long URIs 
are formed. This includes the question of how blank nodes are 
represented. I would simply map them to the serialization of the 
internal IDs. The APIs and SPARQL should provide a function to 
distinguish these special IRIs from "normal" ones, and then functions to 
extract subject, predicate and object, and vice versa. See 
http://datashapes.org/reification.html#tosh for an example implementation.

A minimum practice that the spec could prescribe is that these IRIs need 
to start with, say, "urn:triple:" but this is mainly to make sure that 
no other applications use those URIs by accident (which is really 
unlikely in practice but still...) and for human readers who stumble 
upon them in raw form.

> Considering in the beginning that RDF* was defined in terms of RDF 
> reification. And it can be implemented that way e.g. as POC done for 
> rdflib.
>
> Which gets is back to the reason what do we want to achieve with 
> RDFstar? I think most of us want to have reification without the 
> hassle of writing out a quad all the time. Plus having optimizations 
> possible at the storage level/query engine.
>
> A solution for the triples with all IRI and/or Literals is simply 
> generate an implicit IRI for them.
>
> For the triples with a blank node I think the simplest is to generate an
> implicit new blank node for them. (Which would be skolemized in some 
> form in any triple store anyway).

Not sure why this is necessary. Why not use IRIs? A typical scenario 
would be:

1. RDF* file gets loaded into graph store
2. Graph store selects its new internal IDs for blank nodes
3. References to bnodes from embedded triples use these same (new) IDs. 
All good.
4. RDF* file gets saved back.
5. The cycle repeats on another machine that loads this new file.

While such an RDF* graph is in memory (or database storage) the actual 
IDs don't matter to anyone, and shouldn't be relied on by any external 
graph. This is already the situation for blank nodes now - they are 
anonymous nodes within the current graph only.

Likewise no external graph should rely on the specific syntax of these 
(possibly long) IRIs - they are only accessed and used via SPARQL* 
operators and corresponding functions.

>
> SPARQL and RDF syntax wise this would be simple. API wise this would
> also be easy in python rdflib. We have two new subclasses of URIRef, 
> and BNode basically.
>
>
> class TripleUriRef(URIRef):
>
> class TripleBnode(BNode):
>
> in java rdf4j it would probably be something like
>
> interface Triple{
> }
> interface TripleIRI extends IRI, Triple {
> }
> interface TripleBNode extends BNode, Triple {
> }
>
> Semantics stay the same, we only get a new syntax for reification. 
> Question about PG|SA stays unanswered by doing this.
>
> For me as an user, what benefits would a different semantics for 
> RDFstar bring?
>
> Going the syntactic sugar route we don't need a specification for 
> RDFstar, just for TurtleStar, RDF/XMLstar and SPARQLstar etc.
> Which we would need anyway.
>
> The problem here becomes, lack of a WG means we don't have a good way
> to determine consensus and actually record a decision.

A while ago there was a poll on times for a first meeting. Is this still 
the plan?

Holger


>
> Still mapping RDFstar in terms of RDF reification leaves open the 
> issue of referential opacity. I think that this is a red herring, I 
> think the superman problem is an issue with datamodelling, which 
> should not complicate our entire tech stack. i.e the :superman 
> owl:sameAs :clarkKent is a faulty assertion and that fault leads to 
> the impossibility to correctly express what :louislane believes.
>
> Regards,
> Jerven
>
> PS. Regarding the PG or SA mode I am a fan of going for PG, given the 
> UniProt experience with RDF/XML rdf:ID  which is a PG syntax. rdf:ID 
> being a PG syntax is important for our internal code being needed to 
> generate and read our rdf/xml. Not having this kind of sugar in other 
> syntaxes is why we are still preferably shipping rdf/xml for UniProt.
>
> PSS. about 15% of UniProt triples are reifcation quads. Being able to 
> get these out of the quad table would be nice. Especially the 
> consequences for reducing the joins to use them and how badly these 
> quads fit into current indexes.
>
>
>
>
> On 10/28/20 9:57 AM, Pierre-Antoine Champin wrote:
>> Holger,
>>
>> (I did what should have been done a long time ago: rename this 
>> subthread to something more relevant)
>>
>> On 27/10/2020 23:31, Holger Knublauch wrote:
>>> On 10/28/2020 1:53 AM, Pierre-Antoine Champin wrote:
>>>> Holger,
>>>>
>>>> Now I'm confused. This thread (which should have been renamed a long
>>>> time ago) is, in my understanding, about Martynas' question raised 
>>>> here
>>>>
>>>> <https://www.w3.org/mid/CAE35Vmy3vbThwHnKjbhMQuwKkH0BhNoxr_Gp15Ri5LfOdedsSA@mail.gmail.com> 
>>>>
>>>>
>>>>> Does RDF* need new semantics at all?
>>>> While I believe the answer is "yes", I concede that answering "no" to
>>>> that question would be convenient, because it would mean that existing
>>>> implementations of RDF could handle RDF* at the syntactical level 
>>>> only,
>>>> i.e. parse Turtle* and store it standard RDF triples.
>>>>
>>>> In your examples below, however, you propose to extend existing
>>>> implementations -- which defeats the purpose of fitting RDF* into
>>>> standard RDF semantics...
>>>
>>> The current RDF* draft requires introducing a 4th term type "RDF* 
>>> triples":
>>>
>>> > IRIs <https://www.w3.org/TR/rdf11-concepts/#dfn-iri>,literals 
>>> <https://www.w3.org/TR/rdf11-concepts/#dfn-literal>,blank nodes 
>>> <https://www.w3.org/TR/rdf11-concepts/#dfn-blank-node>andRDF* 
>>> triples <https://w3c.github.io/rdf-star/#dfn-triple>are collectively 
>>> known asRDF* terms.
>>>
>> Correct.
>>
>> In my understanding, introducing a new subclass of Node (TripleNode) 
>> was the implementation counterpart of this extension of the abstract 
>> syntax, but it seems that I was misunderstanding.
>>
>>> The approaches based on (long) URIs avoid this and therefore are 
>>> likely much less concerning w.r.t. existing implementations. A 
>>> syntactic mapping means that existing APIs can represent these 
>>> triples as normal URI nodes. From our own experience moving to this 
>>> design was not disruptive (although some users have raised concerns 
>>> about exposing the ugly long URIs in unexpected places such as 
>>> exporting them to plain Turtle).
>>>
>>> Having said this, *some* implementations may represent these triple 
>>> nodes differently, e.g. using the internal data structure I outlined 
>>> below. This avoids ever storing these long URIs, makes it arguably 
>>> easier to index and search over them, and probably keeps the issues 
>>> with changing bnode identifiers at bay. But this is an 
>>> implementation detail to me.
>>>
>> Agreed, but conversely, implementing the current draft using long 
>> IRIs can be considered an implementation detail...
>>
>> I think we also agree that the aim of the spec is to be as clear and 
>> simple as possible, and that implementations may depart use their own 
>> different internal models (for various reasons: backward 
>> compatibility, optimization...) as long as they behave according the 
>> the spec.
>>
>>> The additional semantics of interpreting these special URI nodes 
>>> differently would be local to the RDF* specs and would not require 
>>> adaptations to existing specs.
>>>
>> Again, agreed: this would be much less disruptive to the entire 
>> ecosystem –and much less work for us in writing this document ;-).
>>
>> But again I think that blank nodes in embedded triples make this 
>> approach very hard. Defining the correct behaviour of "long IRI" when 
>> they actually represent embedded blank nodes, if even possible, would 
>> be extremely cumbersome (as opposed to the clear and simple way that 
>> the spec should aim for). That being said, any PR proving me wrong is 
>> welcome.
>>
>>> Holger
>>>
>
>
Received on Thursday, 29 October 2020 03:11:55 UTC