Re: queries, snowflakes and referential opacity (was Re: Three ideas) from thomas lörtsch on 2022-01-25 (public-rdf-star@w3.org from January 2022)

From: thomas lörtsch <tl@rat.io>
Date: Tue, 25 Jan 2022 15:08:45 +0100
To: Fabio Vitali <fabio.vitali@unibo.it>
Cc: "public-rdf-star@w3.org" <public-rdf-star@w3.org>
Message-Id: <806BF745-35EF-431F-A6DE-79EF5B522959@rat.io>
> Am 22.01.2022 um 23:01 schrieb Fabio Vitali <fabio.vitali@unibo.it>:
> 
> Dear Thomas, 
> 
>> Well, I tried to explain my view on this in the past already but I’ll try again. It all depends on your expectations. The :role property in your example is not necessarily tensed. I can’t remember any property in any Semantic Web vocabulary to be tensed. They don’t explicitly say "now". They say that the relation exists, period. In this view all of your examples above are not wrong (or false, as the logian would prefer to say), they are true. The relation of :role between :Trump and :USPresident undoubtly exists. It is a relation that we can talk about and to which we add further properties like e.g. a timespan. That’s a way of looking at things. This glass is always half full. Given my postulate that no description can and will ever be complete it is also a very healthy way to look at things I claim. It is also what one experiences on the web: you google for "president USA" and you sure will get a few different results. That is normal, and easy to grasp. Everybody understands this. If you ask Google "Who _is_ the president of the United States" (emphasis by me) you’ll probably get the right answer. In general every moderatly well educated person will know that good results require good questions. Nothing new here either. So really, definitely no need for the kind of unasserted assertions that RDF-star embedded triples are. 
>> 
>> I asked before and I think you never answered how you think people should query your proposal. It would IMO very often require a two fold query, asking for the asserted and the unasserted assertion. That’s just laborious, tiresome, for really no good reason. 
>> 
>> Your whole idea IMO is quite overengineered, fixing (well, patching rather) a problem that doesn’t exist if you look at things the right (intuitive, I claim) way. Really, the kind of problem you’re trying to solve IMO only exists for people that got lost in semantic rat holes ;-)
> 
> Sorry for not answering soon, I probably missed the question. So briefly: 
> 
> Let's first consider opinions. Here we do not consider truth or falsehood, but agreements or disagreements. Thus we consider each statement to belong to one and only one category: 
> 
> * undisputed: noone has ever objected so far to this statement. Whether this means the statement is true we do not know nor care: simply, there has been no discussion on it. This is expressed as a plain RDF triple, e.g., [1]. 
> * disputed: two or more different and incompatible opinions exist about this statement. We do not take side, we represent them all as conjectural, and none as asserted. This is expressed with quoted triples (or, in my proposal, with conjectures), e.g., [2] and [3].
> * settled: two or more different and incompatible opinions exist about this statement. Yet, maybe following the majority view of the scholars, we do take side, and represent them all as conjectural, but at the same time one as asserted. I think this is different than simply re-expressing the one as a plain triple: a vase broken and glued back together is different from an intact vase: once a doubt has been expressed, there is no going back. Settled situations are expressed with quoted triples for the losing statements, and an annotated triple (or, in my proposal, a collapsed conjecture) for the winning one, e.g., [4] and [5]. 
> 
> For instance, suppose that this is a dataset containing all types of statements:  
> 
> # no-one ever objected to this
> [1]    :monaLisa      dc:creator :daVinci .   
> 
> # two opinions exist on this, and we take no side
> [2] << :salvatorMundi dc:creator :daVinci     >> :accordingTo :MartinKemp .    
> [3] << :salvatorMundi dc:creator :Boltraffio  >> :accordingTo :JacquesFranck .  
> 
> # this one used to be misattributed, but now scholars agree with Pietro Marani, so we assert this, too. 
> [4] << :annunciation  dc:creator :Ghirlandaio >> :accordingTo :earlyScholars .  
> [5a] << :annunciation  dc:creator :daVinci     >> :accordingTo :PietroMarani  .  
> [5b]    :annunciation  dc:creator :daVinci.                                      

So it seems that you can realize three categories - undisputed, disputed and settled - with two syntactic variants: unasserted and asserted. But what if you still want to annotate statement [1] with a source? Would it then be a disputed claim? What about all the other possible ways to annotate or qualify a statement? It seems to me that an explicit annotation declaring a certain statement as disputed, undisputed, whatever would a provide a more extensible solution, more in line with how the semantic web works in general.


> Then there might be several types of queries. For instance: 
> 
> "Give me all the paintings we now attribute to Leonardo da Vinci": this means both undisputed and settled attributions, so actually only the plain triples: 
> 
> SELECT DISTINCT ?painting WHERE {
>  ?painting dc:creator :daVinci. 
> }
> 
> "Give me all the paintings somebody for any reason attributed to Leonardo da Vinci": this means both undisputed and disputed claims, so in this way:
> 
> SELECT DISTINCT ?painting WHERE {
>  { ?painting dc:creator :daVinci. } 
>  UNION 
>  { << ?painting dc:creator :daVinci. >> :accordingTo ?anyOne . } 
> }

This is the query I was referring to. To me this represents quite a complication of matters. However, two things:

1) in this case your approach is not the culprit. A syntactic shortcut to facilitate querying for a triple in asserted and unasserted from would seem justifified if quaifed relations were to become a standard modelling technique.

2) in other mails I understood that you would make much wider use of unasserted embedded statements, to model practically anything that is qualified in any way - that’s what I was arguing about with you, and in such a broadly defined scenario the need to query for a UNION of both representations IMO would indeed present quite a burden.


> "Give me all disputed attributions of paintings that may (or may not) have been done by Leonardo da Vinci": this means both disputed and settled, so actually only quoted triples:
> 
> SELECT DISTINCT ?painting, ?scholar WHERE {
>  << ?painting dc:creator :daVinci. >> :accordingTo ?scholar .
> }
> 
> "Give me all paintings by Leonardo da Vinci that have never been attributed to anyone else": this means only undisputed, so in this way:
> 
> SELECT DISTINCT ?painting WHERE {
>  { ?painting dc:creator :daVinci. } 
>  MINUS 
>  { << ?painting dc:creator ?someoneElse. >> :accordingTo ?anyOne . } 
> }
> 
> There are several other types of queries, but I think you can see the schema. 
> 
> ----
> 
> For temporal queries, the model is totally analogous. Suppose you have the following dataset: 
> 
> << :GeorgeWBush :role :USPresident >> :between [ :start "2001"^^xsd:Year; :end "2009"^^xsd:Year. ]. 
> << :BarackObama :role :USPresident >> :between [ :start "2009"^^xsd:Year; :end "2017"^^xsd:Year. ]. 
> << :DonaldTrump :role :USPresident >> :between [ :start "2017"^^xsd:Year; :end "2021"^^xsd:Year. ]. 
> << :JoeBiden    :role :USPresident >> :between [ :start "2021"^^xsd:Year; :end "2023"^^xsd:Year. ]. 
> 
> BTW, we may disagree about whether to specify the end of term for Joe Biden, there are advantages and disadvantages. Now I prefer homogeneity of representation, so I put it in. We may also disagree about whether the CURRENT president deserves an additional plain triple without temporal constraints, but I do not like it: it adds an expiration date to the dataset and does not really simplify the queries. 
> 
> Now you can query: 
> 
> "give me all US presidents at any time", in this way: 
> 
> SELECT DISTINCT ?person WHERE {
>  << ?person :role :USPresident. >> :between ?anyTime . 
> }
> 
> 
> "give me the person who was US president in 2012", in this way: 
> 
> SELECT DISTINCT ?person WHERE {
>  << ?person :role :USPresident. >> :between [ :start ?start; :end ?end. ].
>  FILTER (?start <= "2012"^^xsd:Year && ?end >= "2012"^^xsd:Year)
> }
> 
> "give me the current US president", in this way: 
> 
> SELECT DISTINCT ?person WHERE {
>  << ?person :role :USPresident. >> :between [ :end ?end; ].
>  FILTER (?end >= YEAR(NOW()) )
> }
> 
> "give me the US president following George W. Bush", in this way: 
> 
> SELECT DISTINCT ?person WHERE {
>  << ?GeorgeWBush :role :USPresident. >> :between [ :end   ?date. ]. 
>  << ?person      :role :USPresident. >> :between [ :start ?date. ].
> }
> 
> ... and I could go on. 
> 
> I do not think this is overly complex, nor over-engineered. 
> 
> In fact, I totally think is approach is MUCH simpler and flexible than using n-ary relations for opinions and/or temporal constraints. 

What I find over-engineered is not annotated statements per se but your approach to treat every annotated triple as per default un-asserted. I wrote about that before to you, at least two times, and I’m not going to make the whole argument again.

>> A star like scheme:
>> 
>> E1 type event
>>  participant X
>>  activity goingToTheMovies
>>  theatre Y
>>  showing "The Abyss"
>> 
>> 
>> The same as a snowflake:
>> 
>> X goingToTheMovies _:b
>> _:b theatre Y
>>   showing "The Abyss"
>> 
>> The snowflake still resembles the original idea of a triple, even more so if you replace ’showing’ by rdf:value. That’s what Pat prefers IIUC and it is definitely more true to the graph paradigm than the star scheme, but not necessarily easier to query I guess - that IMO very much depends.
> 
> Thank you. I was not aware this pattern had a name, and this is very interesting to me. I used to call it "the wikidata Statement approach", now it makes more sense. I think this approach improves somewhat on the star-scheme model, but not much, for two separate reasons. 
> 
> First, it still adds an additional entity to the dataset only for the purpose of creating something to attach properties to, and the difference is that we are saving one triple of the 4-5 of the star-scheme. Second, this approach messes with the original triple in important ways: for instance it makes the range of the original property fairly complicated, because it must now be a union of whatever was the original range, PLUS a new nondescript class (DIFFERENT FROM the original range) just so you can assign properties to it.   
> 
> In general, I am a newcomer in this part of the SW, but I am surprised at the abundant reliance on blank nodes for handling so many dark and unconfessable aspects of data representation, and this is no exception. Even though they do not have an IRI, they are still nodes, i.e. they represent entities that exist in the dataset, they can be counted, they affect and are affected by the overall ontology, etc., yet they seem to be used as duct tape is used in engineering, as the quick fix to keep any two random things together, good for every situation until we find something better. I am not sure I like it. 

The are a matter that is more complicated than it seems. Aidan Hogan’s "Everything you ever wanted to know about blank nodes" is a good start if you’re looking for a through introduction. 
Just two things:
- they have counting semantics in SPARQL but not in RDF. In RDF they are existential in the FOL sense, but RDF semantics doesn’t REQUIRE leaning.
- they are more than just duct tape, they are indeed a very elegant tool. They provide a means to add structure to graphs without adding much burden. We are very used to structures - lists, trees, tables - that are all not provided out of the box by a grah. Blank nodes help create them without much fuss and without distracting from the core issues we want to express.
- and if you think about the elegenace with which blank nodes allow you to make composite statement although RDF is strictly monotonic and no statement is allowed to rule into the meaning of anoter statement - that wouldn’t be possible any other way I guess.
That’s three things atually.

> Speaking of which, I wonder: a node cannot be the range of a data property... can we still use the snowflake model in these cases? e.g.: 
> 
> Movie director Lana Wachowski was given the name "Larry" and used it until some time around 2008, when she started using "Lana" instead." 
> 
> :LanaWachowski foaf:givenname _:name1 
> _:name1  rdf:value "Larry"; 
> _:name1  :between [ :end "2008"^^xsd:Year ]; 
> 
> :LanaWachowski foaf:givenname _:name2 
> _:name2  rdf:value "Lana"; 
> _:name2  :between [ :start "2008"^^xsd:Year ]; 
> 
> I have used two blank nodes to hold the string values of the two foaf:givenname triples. 
> 
> *** IS THIS USE OF FOAF EVEN LEGAL?  ***

I have no idea. IIUC it wouldn’t be easy to define the range as being x OR y, and not x AND y. But that's another issue.

> BTW, please note that data properties would not create problems in neither rdf-star nor conjectures, since neither introduces new entities: 
> 
> << :LanaWachowski foaf:givenname "Larry" >> :between [ :end   "2008"^^xsd:Year. ]. 
> << :LanaWachowski foaf:givenname "Lana"  >> :between [ :start "2008"^^xsd:Year. ]. 
>   :LanaWachowski foaf:givenname "Lana". 
> 
>> [...] Then there’s your approach
>> 
>> << X goingToTheMovies "The Abyss" >> theatre Y
>> 
>> because you don’t care much about referential transparency (but trust me: you should!) and because you expect people to query for unasserted assertions too (and use precisely the right URIs because you don’t care about referential transparency). You’d also have to introduce some intermediate if you wanted to disambiguate multiple occurrences of X going to the movies.
> 
> More about referential transparency later. 
> 
> But of course the issue of multiple occurrences DOES matter to me, a lot. I think this is one of the ways conjectures are better than rdf-star, but who am I to judge...
> 
> Anyway, I have the feeling that the need to distinguish between different occurrences of the same triple will not be as frequent as it is feared. In many cases we will simply quote the rdf-star triple containing the triple. It means a slightly different thing, but anyway... 
> 
>>>>> These triples are now all ABSOLUTELY TRUE and correct. Not the internal triples, of course: they are still neither simply true nor simply false, and we could not assert them truly in any form. No: the outer triples are simply true. 
>>>> 
>>>> Notwithstanding that I do not endorse this proposal: are you aware that << :JoeBiden :role :President >> doesn’t refer to the person known by the name Joe Biden, a concept named role and a concept of presidency, but instead to the person known by the name Joe Biden AND refered to by the URI :JoeBiden (NOT e.g. that same person but refered to by the URI :JBiden or :JosephBiden etc), and analogously to the concept of presidency AS refered to by the URI :President (NOT the URI :PRESIDENT, or even wikipedia:President, dbpedia:President, LOC:President etc etc). And so on. Is that what you want? Do you find that useful in the general case? I’m asking because you propose to employ this way of modelling _in general_ and I wonder if the referentially transparent semantics of RDF standard reification weren’t in general much more appropriate. IMO you would have to go the route of the still informal :occurrenceOf property to arrive at the entities whose relation you actually woud like to annotate, conditionalize, whatever. I have the feeling that many proponents of RDF-star take it for what they would like it to mean, not for what it really is.
>>> 
>>> I never entered the discussion about referential transparency. I do not want to.  Personally, I do not see much of a point in referential opacity, and I would live much more happily with referential transparency. Yet, I know that, when using the same IRI inside and outside of quoted triples, they both refer to the same entity. This is enough for me: there is a way out (duh: use the same IRI!), and if I understand this correctly referential opacity only makes it more difficult to use owl:sameAs. Well, I can live without it.
>> 
>> IMO it rather has the potential to break the semantic web if used as carelessly as you suggest (and, to your defense, as advertized). That may be slightly exaggerated but it’s definitely a bigger problem than just a few missed owl:sameAs statements of which we have to many anyways. For someone so invested in correct semantics as you I find this lack of interest in a very foundational issue rather bewildering. 
>> 
>> To give you a somewhat drastic example: if you use quoted/embedded triples as defined per the proposed semantics you SAY that your annotation only holds for those specific URIs. How should I know if you really mean it of if you were just too lazy to create the :occurrenceOf intermediary? So are you speaking about a :President but not about a wikipedia:President? Even if I don’t know the exact answer I have to assume that you know what you are doing - especially as you are so crazy scrupulous about the conditions under which your statement is true - and I have to say: I will discard your statement. I don’t know if I know what you mean. I’m talking about a president as defined by Wikipedia, DBpedia, the Library of Congress and whoever else has a reasonably common sense understanding of the concept of a president. You don’t seem to be among those as you insist to use that very special URI and no other to refer to what you might call a president. As you don’t explain why or how your understanding of a president is different then those other conceptualizations from well known sources I have to shield myself against the possibility that you are some sectarian zealot with a very special view on reality that I don’t subsrcibe to. This concern is still valid even if you use the Wikipedia:president concept. I’m still suspicious: why only that, why not the others. There may be something strange about your worldview that I don’t subscribe to. Maybe Wikipedia's definition differs in a subtle but important way from other definitions that I’m not aware of. It seems important enough for you to use embedded triples instead of a referentially transparent representation. In short: I don’t know what you are talking about. You have left the common grounds of the semantic web as we know it. So, off your triple goes. I hope that is drastic enough as an example. 
> 
> I am totally convinced that referential opacity in rdf-star pollutes the well. 
> 
> I am also convinced that the well is already dry and full of snakes. 
> 
> Once upon a time, many years ago, it was decided that different IRIs identify different entities. At the same time, people actively avoided to create a single, global repository where a single IRI for each common entity could be created, shared and re-used. Finally, they selected a higher layer of the semantic web, OWL, to handle the concept of sameness between entities, as if it was a weird ontological aspect of reality rather than a structural foundation of the representation model. 
> 
> Three bad decisions, IMHO. 

You are not "on the web", IMHO. This is a decentralized information system, which brings with it some burdens - but less burdens than chances, some argue.

> Wikidata IRIs are a good step in the right direction, but they arrived too late to prevent the problem. 
> 
> Now we have dozens of different prefixes, each defining their own IRIs for the same entities, which we now have to treat as different entities because they are using different IRIs. We have 12 different Napoleons, 12 different Leonardos da Vinci, 12 different Mona Lisa, and, higher up on the layer cake, we have to use a  a huge list of owl:sameAs triples to tell everybody that they should infer that these entities are really the same thing, whatever "same" means at the ontological level. 
> 
> Suppose now John publishes a new dataset where there is an owl:sameAs triple for only 10 of the 12 well-known IRIs about one of his entities. What now? Did John just forget to add two triples, or are there differences between these entities and John truly does not want to assert they are the same thing? That is not a big deal for concrete entities such as Napoleon, Leonardo da Vinci or Mona Lisa, where I can BET John just forgot about them, but what about abstract concepts such as US President, Democracy, or Geoid? Is John sending a subtle political message through the absence of these two owl:sameAs? 
> 
> Also: what happens when we do not use OWL at all? What can I say of the relations between :USPresident, your:USPresident and dpbedia:USPresident, etc. if I do not accept nor make use of OWL?
> 
> You see, quoting triples whose IRIs are not predictably stable is basically a second-level issue. Many more important ones exist outside of rdf-star. 
> 
> I find the issues arising from these complexities overwhelming and invincible. Therefore, I, personally, decided that I am not fighting this battle. 
> 
> Thus, when I am using an entity that is already defined on Wikidata, I use their IRI. For all entities that are NOT defined on Wikidata, I create my own IRI. 
> 
> I try to avoid the pointless proliferation of abstract entities (e.g., the Event corresponding to the beginning of the creation of the Mona Lisa painting as speculated by scholar John Smith), because they surely do not exist in Wikidata, there would be millions of them, and they would be of little use to anyone (which is why I do not like n-ary relations very much). 
> 
> I also do not like nor use blank nodes, which are simply a way, to me, to say "here is an entity that is sufficiently important for me to mention, so that I can refer to it, but not enough to bother to provide it with an IRI, so that you cannot." 
> 
> To summarize: in general, I find owl:sameAs, multiple IRIs for the same entities, and blank nodes to be snake pits, and I need to gear up appropriately and breath deeply to calm myself before I start dealing with them. Compared to them, referential opacity in quoted triples is a minor nuisance (although not a pleasure cruise). 
> 
> I hope I was clear now. 

Yes you were and I see why you don’t regard referential opacity as a big problem. But I disagree.

Thomas



> Ciao and thank you for your time
> 
> Fabio
> 
> 
> --
> 
> Fabio Vitali                            Tiger got to hunt, bird got to fly,
> Dept. of Computer Science        Man got to sit and wonder "Why, why, why?'
> Univ. of Bologna  ITALY               Tiger got to sleep, bird got to land,
> phone:  +39 051 2094872              Man got to tell himself he understand.
> e-mail: fabio@cs.unibo.it         Kurt Vonnegut (1922-2007), "Cat's cradle"
> http://vitali.web.cs.unibo.it/
> 
> 
> 
>
Received on Tuesday, 25 January 2022 14:09:09 UTC