Re: RDF lists/arrays and n-ary relations [was Re: OWL and RDF lists] from Patrick J. Hayes on 2022-09-30 (semantic-web@w3.org from September 2022)

From: Patrick J. Hayes <phayes@ihmc.org>
Date: Fri, 30 Sep 2022 07:27:18 +0000
To: David Booth <david@dbooth.org>
CC: "semantic-web@w3.org" <semantic-web@w3.org>, "Patrick J. Hayes" <phayes@ihmc.us>
Message-ID: <CD5AC169-EF49-4BFB-8A07-315D3310AF28@ihmc.org>
Hi David

On Sep 28, 2022, at 7:41 PM, David Booth <david@dbooth.org<mailto:david@dbooth.org>> wrote:

Hi Pat!

On 9/28/22 01:42, Patrick J. Hayes wrote:
>> On Sep 27, 2022, at 1:32 PM, David Booth <david@dbooth.org<mailto:david@dbooth.org>> wrote:
>> On 9/27/22 09:58, Pierre-Antoine Champin wrote:
>>> lists do not only give you order, they give you "closedness": the
>>> first/rest ladder captures the fact that the list contains all
>>> these elements *and only them* (and in this particular order).
>>
>> Small but important clarification: currently RDF lists do NOT give
>> "closedness",
>
> Sure they do, if you use the vocabulary properly.

But that was my point: RDF does not currently *enforce* that it
be used properly.  A future RDF should support arrays that are
impossible to malform.

That should be an extension on top of RDF, not a revised version of RDF. Imposing syntactic rigor doesn't change anything about RDF itself, either its syntax or its semantics. It would be an RDF app that rejects RDF graphs with 'strange' (including incomplete) collection triples. The rejection would be nonmonotonic, but the underlying RDF would not be. All of which is fine and quite consistent with the current RDF specs, so no new standard is required. Just implement the checks, document them, call it a semantic extension (refer to the specs) and run with it.


> :thislist rdf:first :A .
> :thislist rdf:rest :x1 .
> :x1rdf:first :B .
> :x1 rdf:rest :x2 .
> :x2 rdf:first :C .
> :x2 rdf:rest rdf:nil .
>
> tells you that :thislist is exactly ( :A :B :C ), with three
> items. You can't express this kind of 'closedness' with the
> container vocabulary because it provides no way to say 'and
> no more'.
>
> Now, of course RDF does not impose 'proper' usage of the collection
> vocabulary, because it imposes hardly any syntactic restrictions
> of how any vocabulary is used. But you can define a semantic
> extension of RDF which does impose this, and indeed the specs
> mention this possibility explicitly.

Right, that was my point.

But if the current specs tell you explicitly that doing this kind of checking is legal, why do you feel that we need a new version of the specs to do it?

  More below . . .

>> but "closedness" is definitely what we want (for the vast majority
>> of use cases).  That is precisely one of the reasons I and others
>> feel such a need for (closed) arrays.
>>
>>> . . .
>>>> Hugh Glaser wrote:
>>>> I also worry that if I assert exactly the
>>>> same knowledge twice, a paper could end up with two authorLists,
>>>> certainly if bonds got involved.
>>> Indeed...  That's actually something that "lists as first class
>>> citizens" could help solve -- that is, if they were defined in
>>> such a way that two lists with exactly the same elements are
>>> in fact one and the same object.
>
> Whoo, wait a minute. Do you really want that kind of extensionality
> condition? That's not true in LISP, for example, and I am pretty
> sure its not true in any programming language that uses linked
> lists as a data structure.

YES, we want that kind of extensionality!  But we want it for
*arrays* -- not linked lists.

OK, if you insist. But these arrays are not anything like arrays or vectors in most programming languages. I can have two arrays in Algol-60 and just about every language since which have the same elements but are not identical. For example, if one of them is changed or deleted, the other is unchanged.

So it might be a good idea to not call these things 'arrays', maybe? HOw about 'sequences'?

(Why do you want this extensionality for 'arrays' but not for lists? Just asking.)

 The fact that RDF only provided a
linked list is a historical artifact that only muddies the waters.

The point is that a future RDF should offer arrays that have array
semantics -- not linked list semantics: each element should have
an implied consecutive integer index.  And that index should start
from 0, by the way -- not 1.

OK, but this is syntax, not semantics.

Software developers have to deal
with these things, and developers have learned over several decades
of experience that 0-based indexing is less programming work than
1-based indexing.

An array *could* be implemented as a linked list, but the
implementation is completely immaterial -- and an unfortunate
distraction -- because it should be hidden from the user.

I agree, let us talk at a conceptual level ignoring implementation decisions.

>> Yes, that is exactly what's needed, and it is readily attainable
>> if we eliminate *explicit* blank nodes.  By explicit blank
>> nodes, I mean blank nodes that are written like _:b42 in Turtle.
>> Implicit blank nodes, written with square brackets like "[ ... ]",
>> do not cause problems because they are guaranteed to be acyclic.
>
> Wait, wait, the nonsequiteurs are making me dizzy. First,
> extensionality (ie the condition same elements => same list)
> has nothing to do with blank nodes.

Unfortunately it has a *lot* to do with blank nodes, because
"sameness" is not easy to determine in the presence of
unrestricted blank nodes.

No, it is trivial, for both 'arrays' and for linked lists using the first/rest vocabulary.

 It is the graph isomorphism
problem

They are both trivial subcases of that general problem.

, as Jeremy Carroll pointed out two decades ago:
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.hpl.hp.com%2Ftechreports%2F2001%2FHPL-2001-293.pdf&amp;data=05%7C01%7Cphayes%40ihmc.us%7C4feecfe094cd4b888a1a08daa1b3509d%7C2b38115bebad4aba9ea3b3779d8f4f43%7C1%7C0%7C638000089395083024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=kyvHM%2FuhL67NXEzA%2FgQ%2Bqy9Ip4buyOhvd%2B4QyJwlKKw%3D&amp;reserved=0

https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FGraph_isomorphism_problem&amp;data=05%7C01%7Cphayes%40ihmc.us%7C4feecfe094cd4b888a1a08daa1b3509d%7C2b38115bebad4aba9ea3b3779d8f4f43%7C1%7C0%7C638000089395083024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=rApMY0T9V0s31pWpwlwJfHECPtzob3sbCq0r3vemk3U%3D&amp;reserved=0

That's why after 20+ years we still do not yet have a standard for
RDF graph canonicalization, though a W3C group is now working on it.

I know. I only narrowly escaped being drafted onto that WG :-).

But if blank nodes are restricted to being acyclic -- i.e., no blank
node cycles -- then it is trivially easy to compare two graphs for
"sameness".

Aidan Hogan did a wonderfully comprehensive analysis of blank nodes
in his paper "Everything You've Always Wanted to Know About Blank
Nodes: https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Faidanhogan.com%2Fdocs%2Fblank_nodes_jws.pdf&amp;data=05%7C01%7Cphayes%40ihmc.us%7C4feecfe094cd4b888a1a08daa1b3509d%7C2b38115bebad4aba9ea3b3779d8f4f43%7C1%7C0%7C638000089395083024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=xITySQ4mtp8fKYFyrGyJ%2Fb6HcobHLFVrsWK4EcGdYw4%3D&amp;reserved=0


> Second, even implicit blank nodes are still blank nodes, so how
> can their explicitness be important?

Because eliminating explicit blank nodes is an easy way to guarantee
that the graph has no blank node cycles.  And the vast majority
of blank node usage can be done with implicit blank nodes.

> Third, how did being acyclic enter into the discussion suddenly?

I'm referring to blank node cycles.  As explained above, they are
the source of the problem.

>> This means that it would be easy for tools to generate a
>> consistent internal identifier for them, based recursively on
>> their constituent elements.
>
> And those would be blank node identifiers, right?

No.  They should be some kind of internal identifiers
that the tool can use to easily determine "sameness": if two objects
have the same internal identifier, then they denote the same object.

BnodeIDs do exactly this if by 'object' you mean the bnode itself, which I think is what you need when talking about data structures as part of the syntax.

Those internal identifiers should not be exposed to the user.

As a matter of user interface design and syntactic sweetness, OK, but surely they play exactly the same role as a bnode inside the actual RDF, so why not call then bnodes? If they are something else, then you have introduced a new class of identifiers into RDF triples and the semantics have to be re-written, etc., all of which fuss seems to me to be unnecessary.


> So what you are suggesting here, if I follow you, is a scheme
> for letting systems generate their own bnodeIDs for the innards
> of list structures. Fine, but this is not a modification to RDF.

Close, but not bnode IDs, but internal identifiers; and not list
structures, but arrays.  If those arrays happen to be implemented
as list structures, that's fine but immaterial, because the
implementation should not be exposed to the user.

Arrays, OK, Containers, in fact, right? Though starting from 0 instead of 1, and the graph is required to be sensible in the same way, eg no gaps, no double-entries and the largest-numbered item in the graph is the last item. (A slight scent of nonmonotonicity there, but acceptable.)

>> (This is closely related to RDF canonicalization, which becomes
>> trivially easy without explicit blank nodes.)  Eliminating
>> explicit blank nodes would mean that we'd lose the convenience
>> of not having to allocate an IRI.  I think there are ways to
>> address that loss in other ways, but that's another topic.
>>
>>> But that would depart from their current interpretation, and
>>> not necessarily fit all use-cases,
>>
>> Agreed, but it would fit the most common use cases.  It doesn't
>> have to fit *all* use cases.
>>
>>> so this is not something to decide lightly. This is the kind of
>>> semantic rabbit hole that Pat Hayes was warning about earlier
>>> in this thread (if I got his point correctly).
>>
>> I hope Pat will correct me if I'm wrong, but my read of
>> the discussion so far is that the semantics would not be a
>> big problem: both arrays and composite objects can have very
>> straight-forward -- and very similar -- semantics.  And it's clear
>> to me at least that although the rdf:aboutEach functionality
>> could be useful in some cases, it is not what we need as the
>> basic array functionality.  The basic functionality that we need
>> is for an assertion about an array to *only* be about that array
>> -- not about every element in that array:
>>
>> ("apples" "bananas" "peaches") ex:length 3 .
>
> If indeed that is all you want, Pat agrees this would be trivially
> straightforward. Pat is however very suspicious that this is
> in fact not all that people want, and they they will be writing
> things like
>
> :PatHayes :fatherOf (:SimonHayes :RobinHayes)
>
> before the ink is dry on the specification document.

That's okay!  There would be nothing wrong with writing that,
*provided* that you define the :fatherOf property to mean that
:PatHayes is the father of every person in that list.  It may then
be nonsensical to assert the following:

 :PatHayes :fatherOf :SimonHayes .

But even that could be perfectly fine to write if you instead
define the :fatherOf property to have *conditional* meaning: if
the object of the assertion is a person, then it asserts fathership
about that person.  But if the object of the assertion is an array,
then it asserts fathership about every person listed in that array.

That last is a cute idea. We could have a name for such properties, call them distributive, giving inference patterns like

:P rdf:type :DistributiveProperty .
:A :P (… :Bn …) .
=>
:A :P :Bn .

And there are of course many other possibilities, eg

:P rdf:type :InitialProperty .
:A :P (:B0 …)
=>
:A :P :B0

ie just the first item.  And things like ordering, where

(:B0 … :Bn) rdf:orderedBy :P .
=>
:B0 :P :B1 .
:B1 :P :B2 .
…
:Bn-1 :P :Bn .

eg (2 13 47 128 763 1246) rdf:orderedBy :lessThan .


>> And one other comment . . .
>>
>> On 9/27/22 09:43, Pierre-Antoine Champin wrote:
>>> If we can design other efficient design patterns for conveying
>>> order and "closedness" (such as the one proposed above), I
>>> believe that the need for representing lists would not be as
>>> pressing as suggested in this thread.
>>
>> Possibly.  But software developers have been using arrays for
>> 60+ years, and they *expect* them.  So as a practical matter,
>> I think the straight-forward solution is to add proper support
>> for arrays, perhaps in a new higher-level RDF 2.0 syntax, to
>> avoid breaking any existing RDF or tools.
>>
>> As Manu Sporny put it, by not having proper array support in RDF,
>> we're currently "giving developers a big fat middle finger in
>> that area".
>
> Both you and Manu miss the central point here. RDF is not intended
> to be a notation for software developers to create new structures
> with. It is not a developer toolkit.

That's exactly the weakness that we need to address.

I disagree. It is not a weakness, and to think it is, is a misunderstanding. Chalk and cheese, etc..

> It was designed and intended to be an information exchange
> notation. As soon as you give it to developers and say, in effect,
> go ahead and play with this and build things with it, then all
> of its utility as an information exchange notation is lost,
> because whatever meaning one developer intends to express using
> the structures she develops will be opaque to any other developer
> and any user in a different development environment.

They will only be opaque if the predicates that use them are not
documented.  If they are properly documented, then the meaning will
be clear.  This is true already, of *all* of RDF.  Adding features
like arrays and composite objects does not fundamentally change
the need to define the meaning of the predicates used.

In some large sense, I agree. But extending the logical syntax of any language with a semantics - which is what is being proposed here - requires that the semantic framework be extended to cover these cases. You have suggested how to do that, I agree, and it is exactly the 'minimalist' approach I would suggest myself. But then it needs to be very firmly documented, using all kinds of strict normative language, that this is indeed the semantics everyone should be working with, so for example if anyone wants to express that

PatHayes :fatherOf (:SimonHayes :RobinHayes) implies :PatHayes :fatherOf :SimonHayes .

then they need to find a way to make this explicit (eg see above) as the semantics of RDF 'arrays' does not itself support this automatically.


The example that you gave above would be meaningless to others if
the :fatherOf relation were not defined.

True, but in a different sense of 'meaning'. Inference machinery does not need to be able to read human documentation, only more RDF. And the inference patterns for 'arrays' and 'lists' and any other new syntax need to be made explicit enough to support RDF inference engines.

 And as I pointed out, it
can just as well be defined to make an assertion about one person
as it can to make an assertion about every person in a list.

> There is a kind of universal understanding about what an RDF triple
> 'means': it says that a relation holds between two things, and
> it is therefore an assertion about a world consisting of these
> things with relations holding between them. This consensus of
> intended meaning is perhaps a bit shaky at the edges and under
> strain in some places, but still is kind of universally understood
> and accepted. But there is no such consensus AT ALL about what a
> list is supposed to mean, or what an array is supposed to mean,
> when used as part of an assertion about this common 'world' of
> things bound together by relations. And because there is no such
> consensus, as soon as these structures are given to developers to
> build with, what they build will have, at best, an idiosyncratic
> meaning private to the community in which the developer is
> working. At which point, the entire purpose of RDF is lost.

Wait, let's not throw the baby out with the bath.  The meaning
will only be private if it is undocumented.  But as I pointed out,
that is already true of *all* of RDF.   You have no idea what
<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdbooth.org%2Ffribjam&amp;data=05%7C01%7Cphayes%40ihmc.us%7C4feecfe094cd4b888a1a08daa1b3509d%7C2b38115bebad4aba9ea3b3779d8f4f43%7C1%7C0%7C638000089395083024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=G8iZf7cN2s7jRjAsq45QwRQmopxrjNEqoM52V%2Bc54B0%3D&amp;reserved=0> means unless I document its meaning.

But I don't need to know that, if I am an RDF inference engine. All I need are the basic rules for processing RDF, as supported by the semantics. But when the RDF syntax gets extended in new ways, I need that same basic guidance for those new extensions. And all that part of the 'meaning' should be provided in the normative specifications.

The purpose of RDF is *not* lost just because we need to define the
meaning of the predicates that use composite objects and arrays, just
as we need to define the meaning of everything else we use in RDF.

To further clarify, an array such as the following (reusing the
Turtle syntax):

 (:SimonHayes :RobinHayes )

should have no intrinsic meaning beyond the fact that it is an
array of two elements denoted by :SimonHayes and :RobinHayes, in
that order.

Fine. As long as we make this (what I called the minimalist semantics, above) absolutely clear, and require all these developers who are clamoring for better RDF datastructures to stick to this, and not accidentally introduce further assumptions about meanings into the apps and software they develop, then everything will be hunky-dory. Forgive me if I am a little cynical about developers having the restraint to accept this kind of discipline.

 This allows it to be the exact *same* array --
extensionality -- in the following two assertions:

 :myStore :customers ( :SimonHayes :RobinHayes ) .
 :PatHayes :fatherOf ( :SimonHayes :RobinHayes ) .

However, the :customers and :fatherOf predicates can (informally
speaking) impart *different* meaning to those arrays by the way that
they are defined: the :customers predicate can indicate that every
member of that list is a customer of :myStore, and the :fatherOf
predicate can indicate that every member has a father :PatHayes.

This is no different than using the number 42 in different contexts:

 :boston :temperatureFahrenheit 42 .
 :marblesInMyPocket :count 42 .

Formally speaking, the meaning of 42 is the same in both of
those assertions, but it is *used* in different ways by different
predicates, so informally/colloquially we might say that it has
different meaning in different contexts.  The exact same thing
should be true of an array or a composite object.

As I say, fine. If you can get the world to agree.

> Which is why any new syntactic extension to RDF should be given
> a semantics as part of its normative definition, and this should
> be one that is likely to survive the pressures of how developers
> are wanting to use the new structure,

I agree that they need a well-defined meaning.  But the meaning
defined by the RDF standards can be minimal, because additional
meaning can be defined by RDF authors.

> For example, if it is clear that some folk really want to use
> arrays to express n-ary relations, while others really want to use
> them to abbreviate conjunctions but with a closed-world assumption,
> then these two groups should be given distinct extensions to RDF
> syntax which will not be confused with each other, and each given
> a nice crisp semantics and tutorial examples, etc..

For arrays and composite objects, I don't think we need different
syntaxes for all of the many ways that they may be used, because
as demonstrated above, additional meaning can be imparted by the
predicates that use them.

How would you do that for the two cases I sketch, above? One thinks that  :A :P (:B0 … :Bn) .
really means (changing notation here) :P(:A :B0 …  :Bn), ie that :P is a n+2-ary relation. The other thinks that it means the graph
:A :P B0 .
:A :P B1 .
…
:A :P Bn .
ie a conjunction of n+1 binary assertions, together with a kind of CWA about there not being any more of them.

Neither of these can be expressed in current RDF, of course, so each would require a more drastic kind of extension to the RDF semantics. And both of them have been explicitly suggested as likely intended meanings in posts in this and other threads (eg see the RDF-star email archives). Don't you think there might be just a faint possibility that some developers might seize upon the opportunity to start using these new structures in these ways?

 But if commonly used patterns emerge
then it *might* be worth adding syntax to support a few of them.

My concern is that communities will form using a single RDF syntax in these and other ways, all mutually incompatible, and then any attempt to sort out the resulting mess will turn into a turf war about what RDF 'arrays' REALLY mean.

Pat


Thanks,
David Booth
Received on Friday, 30 September 2022 07:27:41 UTC