Re: Options for RDF Expression of Signal Data

[Moving a gdoc thread to email]

On 2/6/20 12:41 PM, Sandro Hawke wrote:
> Today's initial meeting of the Data Access Task Force was not a great 
> success, since no one else showed up. (I'm interested in hearing 
> (perhaps privately) why this was. Perhaps just too-short notice or 
> otherwise poorly announced. Ten people confirmed that time slot in 
> general.)
>
> On the upside, I took the time to write up the issue, and maybe this 
> is better done in writing anyway.
>
> I'm starting with what basic graph shape to use to represent the n-ary 
> relation inherent in the first signal.
>
> See Options for RDF Expression of “Date Website First Archived” 
> <https://docs.google.com/document/d/1f7hWNybjYcSFFRT48LcLdmH3KBBoYUeH-lKJypu6zQw/edit#heading=h.cn7f188h2acs>. 
> Comments in email or the doc welcome. I think this might just come 
> down to taste, but if there are any actual problems with any of the 
> options (beyond zero), it would be good to highlight those before 
> making a decision.
>
>      -- Sandro
>

On 2/6/20 2:26AM Pay Hayes wrote (in a google docs comment 
<https://docs.google.com/document/d/1f7hWNybjYcSFFRT48LcLdmH3KBBoYUeH-lKJypu6zQw/edit?disco=AAAAGMpCFhg>):

> OK, here's my 2c on this. 

:-)   <scared look>

> First, its not clear that this actually is an example of an N-ary 
> relation. It /could/ be described that way, but semantically it is 
> more like a single triple plus a comment about it (citing evidence for 
> it, in effect). 

Indeed, I was simplifying a bit. The references I cited (On Nary 
Relations <https://www.w3.org/2004/08/12-Yoshio/onNaryRelations.html>and 
Defining N-ary Relations on the Semantic Web 
<https://www.w3.org/TR/swbp-n-aryRelations/>) both use the term in their 
title but then go on to cover many patterns, like ones about evidence, 
which are not n-ary relations in this strict sense. Is there a better 
term for the broader problem?

> This matters because there will be cases which are genuinely N>2-ary 
> relations, and you might want to not get them confused with this case, 
> by keeping distinct encodings from day one.

Perhaps. Or it might be simpler to have a general n-ary-ish model that 
works across the sphere.  Not sure where the "as simple as possible but 
no simpler" line falls here.

> Second, several of your alternatives don't make first base. Shape 4 
> fails because asserting the reification doesnt actually assert the 
> reified triple, it just says it exists. (See RDF semantics.) 

Ah, true, I wasn't seriously proposing that one, so I missed that bit. 
That's "easy" to fix by adding another arc, "a cred:Assertion". I've 
added a note to that effect in the doc, but not modified the example 
(yet). One could also make being-asserted a part of the semantics of 
cred:evidence, but I wouldn't do that.  I put quotes around "easy" 
because of my sense that truth predicates are dangerous.

We probably don't need to talk about this option more, unless someone is 
actually advocating for it.

> Shape 1 fails because what does _:a denote? What kind of thing is it? 
> I can't find any interpretation that makes sense. It seems to be both 
> a date and a 'credibility signal', whatever that is; but it is also 
> the subject of operationalAsEarlyAs. Can a date be operational at a date? 

[ Aside for the audience: Pat and I have been having this kind of debate 
for about 19 years now. As a newbie, I used to find the tone and 
strength of his arguments unsettling. Actually, I still do. ]

Sorry, it looks like my class naming turned out to be misleading. I was 
assuming the context for people seeing this model would be the planned 
new signals document.  Here's the current title and abstract of that 
document, to give that context:

    Community-Approved Credibility Signals


    Abstract: Credibility signals are observations, made by humans or
    machines, which are used in deciding how much to trust some
    information. This document specifies some types of these
    observations which seem particularly useful in online credibility
    assessments, especially when assisted by machine processing and a
    network of people and systems making related observations. It also
    includes some guidance on how credibility data (that is, data
    expressing these observations) can be exchanged online. The choice
    of which signals to include was made by the W3C Credible Web
    Community Group and is expected to be revised periodically in light
    of new information.

With that in mind, perhaps it's more clear that _:a denotes an 
observation, a "credibility signal". Perhaps a signal isn't exactly "an 
observation", but is more a record of the observation, or it's the 
information obtained by making an observation, but I doubt that level of 
semantic detail will be beneficial.

In this example, _:a represents an observation of the date a website is 
first archived. So perhaps that class name should be 
"cred:ObservationOfDateWebsiteFirstArchived", but in the context of 
hundreds of other such classes, all starting "ObservationOf..." I 
thought it best to drop that part. But maybe that's bad, since some 
people (like you in this case) will be coming in without that context 
and somehow thinking that when we call it a date it's a date.  :-)

> Shape 2 kind of makes sense except for the relation 
> credAbout:operationalAsEarlyAs which is misnamed, since (as you point 
> out) the _:b node can have other information attached to it so is not 
> particularly connected to anything to do with being as early as. What 
> you need is something like credAbout:avatar since _:b is playing the 
> role of a surrogate for the main subject here.

Yeah, I was just trying to follow the naming conventions of WikiData. 
They view it as annotations on an arc, where they split the arc with 
this blank node in the middle, and have the inbound and outbound parts 
retain the name of the original arc, just in different namespaces.  At 
least, that's what I recall. I'm having trouble finding a simple and 
clear explanation of their approach. One reference is Reifying RDF: What 
Works Well With Wikidata? 
<http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf> but 
hopefully someone knows something more recent and/or simpler.

> But the one that works best is shape# 3. That captures the intended 
> meaning perfectly: it asserts the main statement directly, and adds a 
> comment ABOUT it. This correctly keeps data and metadata separated 
> without either of them distorting the other or requiring some kind of 
> hard-to-remember artifactual encoding. So it has my enthusiastic vote.

It was also my favorite, until I started actually writing software to 
use this stuff. In practice, I find that I want to use named graphs for 
multiple things, but using them for more than one thing at once is a 
problem. Like, alice.example and bob.example are each providing data 
streams with these website-date observations. I want to use the graph 
name to keep track of what came from Alice and what came from Bob, but 
that somewhat conflicts with using it inside the data. How do I record 
that a quad came from Alice? I think there are some techniques, but they 
seem to get fairly complicated, and they do this in the neighborhood of 
a security layer (since I might trust triples from one source more than 
from another), which increases some risks. After a while, I found my 
preference shifting away from this.

There's also an issue which I'm surprised doesn't bother you: the 
semantics of RDF datasets. There are two aspects here: (1) the statement 
in the named graph isn't exactly asserted by the dataset; and (2) The 
graph name (_:c) does not actually denote the graph, it is merely paired 
with it. These are both issues that you and I talked about in the 2011 
RDF 1.1 Working Group, but as I recall were never settled. I think the 
relevant docs are RDF 1.1 Concepts and Abstract Syntax 
<https://www.w3.org/TR/rdf11-concepts/> and RDF 1.1: On Semantics of RDF 
Datasets. <https://www.w3.org/TR/2014/NOTE-rdf11-datasets-20140225/>

That said, I don't think those semantic issues are necessarily 
real-world problems. I think in practice it's entirely possible to 
publish and interact with datasets with whatever semantics we actually 
want, with all the potential problems avoided by flagging the dataset in 
some metadata as having our chosen semantics. Maybe.

> <End of 2c rant>

Thank you so much for your thoughts on this. I'm curious if I've missed 
any of your points, or if I've swayed you at all.

       -- Sandro

Received on Friday, 7 February 2020 15:22:17 UTC