On lists, Lists and (...) - [was] defining the semantics of lists from thomas lörtsch on 2020-06-21 (semantic-web@w3.org from June 2020)

From: thomas lörtsch <tl@rat.io>
Date: Sun, 21 Jun 2020 20:36:05 +0200
To: semantic-web <semantic-web@w3.org>
Message-Id: <242A9103-30D6-4690-B8C8-37FB52C34E8F@rat.io>

Hi,

in May I asked about list semantics as I was trying to understand why rdf:Containers are considered ripe for deprecation and everybody seems to use rdf:Lists instead, or more specifically, what exactly it is that makes people say that "rdf:Lists have more semantics". I also wanted to find a way to put rdf:Containers on equal footing with rdf:Lists for reasons summarized below.
I learned a lot from the ensuing answers and discussions and would like to share how I now see the range of problems and possible solutions w.r.t. lists in RDF. This is predominantly about rdf:Containers and Collections aka rdf:Lists but like with many problems in RDF there are aspects that involve Named Graphs, native DataTypes as nodes, very basic questions of identification and the general design of a Knowledge Representation language that is both simplistic+approachable and powerful+versatile.

Pat in his first reply to my initial mail warned against pushing the limits of RDF too far as it is not a programming language. OTOH some argue that RDF needs a proper datatype for lists as it is such an ubiquitous and extensible (list > tree > table) datastructure. Then there’s the everlasting attractiveness of Lisp-style code as data and data as code. Most developers try to keep complexity down by minimizing the number of paradigms and languages they use to get some work done. That of course often leads to overly complex languages. Still: a data oriented language as RDF that offers only half-hearted support for the most basic data structure around doesn’t fit that bill very well. Lists will probably become even more important with more uptake in Linked Data based applications. So maybe better pave the cow paths.
The result of this discussion is important for the question if lists should become a proper, native datatype in RDF or if alternatively establishing semantic extensions for more rigid triple-based lists through Named Graphs is the way to go. Both aspects will be discussed below. The answer to this debate however only tangentially touches on my initial question about rdf:Containers and Collections.

The semantics

The specs are pretty clear about the lack of any formal standing of the semantics of rdf:Containers and rdf:lists alike. Their semantics is basically just handwaving. What differentiates them most is that rdf:Lists provides an - again: informal - notion of closed-ness as the last member is clearly marked by the rdf:nil attribute.
Restricting the semantics of any kind of lists in RDF to informal handwaving is necessary to uphold the Open World Assumption of RDF. Each statement has a self-contained meaning and other statements are only allowed to monotonically add to that, never to retract existing statements. Contradicting assertions are unavoidable and occur all the time on the Semantic Web. Constructs from the RDF vocabulary however can’t be allowed to reign into each other. Lists, which are necessarily composed from multiple statements, can not be constrained through other statements. Consequently RDF can’t give any guarantees about their well-formedness. All it can do is describe them.
To be precise the semantics of rdf:Lists is indeed a little stronger than just handwaving. The list-ness of rdf:Lists is reflected in its first-rest-ladders - as cumbersome as they may be - and the definite rdf:nil element that closes them means that adding an element to a list requires removing the previous closing statement. This is more than rdf:Containers can ever provide, even with a vocabulary extension as proposed below.
Semantic extensions to RDF like OWL DL may impose further restrictions and OWL DL does indeed specify constraints for the rdf:List vocabulary that rule out malformed lists like e.g. those having multiple heads, tails or branches. However for other reasons it reserves the use of rdf:List to encode T-Box axioms only.
Lists not implemented by triples but inside a node as a proper datatype are another option, discussed further below.
Okay, I hope I got the semantics right this time. A remark: the RDF 1.0 Primer [0] and Semantics [1] from 2004 provide very gentle introductions to the semantics of RDF. I should have read them before bothering this list.

Containers versus Collections

The rdf:Container and rdf:List vocabularies have different strengths and weaknesses. The rdf:list syntax is based on Lisp-style first-rest-ladders and is very verbose. It's hard to write and read, even harder to query, has a high triple count, performs worse than rdf:Containers [2] and can’t be used in OWL DL outside of OWL axiomatic statements. Its big advantage is that the widely preferred syntax for RDF, Turtle, provides very nice syntactic sugar for rdf:Lists. The rdf:Container vocabulary on the other hand has a syntax that is quite easy to write and read, it has some preconfigured types that reasonably well capture common use cases for lists, it is easier to query in SPARQL and performs better in common triple stores. The syntax however, as okay as it might be, looses hands down against the syntactic sugar of rdf:Lists in Turtle. And unlike rdf:Lists, which are always finite, rdf:Containers can’t be limited to a certain length.
While the claim is often that rdf:Lists should be preferred over rdf:Containers because they "have more semantics" I strongly suspect that the most important reason for their popularity is the syntactic sugar that Turtle and now also JSON-LD provide. Some went back to using rdf:Containers because the rdf:List support in SPARQL is so cumbersome and incomplete. It has to be said though that while querying rdf:Lists in SPARQL is not pleasant, support for rdf:Containers is not that much better and could definitely use some love as well (*).
The argument of "more semantics" however is really self-contradictory: the semantics of rdf:List, informal as they are, always limit and close a list, which is the exact opposite of what RDF is designed for. I’m not arguing against the use case per se, obviously, but making closed-ness the default, even the exclusive semantics for lists is going fully against the essential idea of RDF.
So if one wants to close a list for some good reason (**) one can’t use rdf:Containers. Going with the swarm means closing each and every list, no matter if that makes sense or not. If one wants to use OWL DL reasoning rdf:Lists are off limits. If easy writing is required, rdf:Lists rule supreme. When querying comes into the picture (ergo: actually putting to use what we wrote) rdf:Containers win again, although not by a very wide margin. But in the end, all things considered, nobody wins. No solution got it all, and for no good reason. Put harshly: it’s a mess.

Now what?

SPARQL 1.2 is expected to provide better support for rdf:List through easier query syntax and better support for returning ordered lists. This could solve the most glaring gap in tool support for rdf:Lists. However the intimidating base syntax of first-rest-ladders will inevitably still surface every now and then.
Performance of rdf:Lists might get better when implementations learn to represent rdf:Lists internally in a more reasonable way.
Another question is if the Semantic Web is really well advised to standardise on a list style that through its closed-ness runs against the fundamental design of its semantics. RDF semantics are not the most popular topic but maybe people would have an easier time following them if certain kinks and hard edges were smoothed away.
In OWL DL, the flagship of reasoning powers on the Semantic Web, rdf:Lists are reserved for axiomatic statements in the T-Box. Closed-ness of lists in A-Box data can only be expressed through even more cumbersome design patterns which very seldom get used in the more pedestrian areas of the Semantic Web. That presents an at least unpleasant gap in interoperability. But OWL DL largely has its own user community which can handle complex design patterns alright.
So the jury is still out if anything has to be done at all. The situation surely could be improved, but is it worth the effort?

A little upgrade to the Container vocabulary

I entered the discussion with an idea to extend the rdf:Container vocabulary by a length attribute that would allow to close Container-based lists just like rdf:Lists. That way I thought we could keep all the good parts of rdf:Containers plus add the more on semantics that rdf:Lists have and we could put such lists into OWL DL reasoners without much ado. The only real issue would be how to provide syntactic sugar for them in Turtle as all the braces are already taken. After I better understood that the semantic issues go a bit deeper and after some clarifications and discussions I still think the approach has some merits, is actually rather modest and not too hard to implement. So here it is, again but improved:

rdf:limit
rdfs:domain rdf:Limited ;
rdfs:range rdfs:ContainerMembershipProperty .
rdf:Limited
rdfs:subclassOf rdf:Container .

The naming of the class rdf:Limited tries to evoke abstract classes in Java as the rdf:limit property shouldn’t be used on its own but only in conjunction with one of the proper rdf:Container subclasses rdf:Seq, rdf:Bag and rdf:Alt.

The RDF semantics of limited Containers are simple and they are just as informal as the rdf:Container vocabulary they are based on. Describing (!) an rdf:Seq, rdf:Bag or rdf:Alt Container as limited expresses the intent that the list is complete and no other members need to nor should be added. As one would naturally expect from any rdf:Container the rdfs:ContainerMembershipProperties should start with 1 and be incremented by 1. In other words there should be no gaps in the enumeration of the members as otherwise the rdf:limit property wouldn’t really describe the size of a list. Surplus members shouldn’t be considered actual members of the list.
Constraint languages like Shacl and Shex can note the intent stated by rdf:limit and take it from there by providing ways to enforce the limit together with the well-formedness conditions it relies on in the Closed World semantics of applications.
Likewise semantic extensions to RDF like OWL DL might introduce syntactic well-formedness restrictions: limited rdf:Containers would have to be segregated from rdf:Lists, they would have to be well-behaved (starting from _1, increment by 1, no doubles, limit is the last member) and grounded. Consequently they would entail missing members with blank nodes (as the limit suggests that they exist), treat multiple values of the same membership property as owl:sameAs each other and consider statements introducing surplus members outside the limit inconsistent (***). Preliminary discussions suggest that the slightly unconventional approach of using rdfs:ContainerMembershipProperties not as properties but values of the rdf:limit property will indeed work in OWL DL.

Support for rdf:Containers with syntactic sugar equal to that of rdf:Lists in Turtle would probably be the biggest issue in an effort to bring rdf:Containers on par with rdf:Lists. JSON-LD technically could be extended rather easily [7] but in Turtle all braces are already taken. So either Turtle users would have to live with ContainerMembershipProperties or some more extravagant approaches would have to be considered like e.g. attributing bracketed lists like so: (…)@Seq - similar to how JSON-LD does it - but that is really just an illustration, not an actual proposal.

Named Graphs

Named Graphs have been mentioned as a means to define contexts in which other semantics than the default OWA could be enforced. Named Graphs as implemented in SPARQL and standardized in RDF 1.1 [8] however lack the solid denotational semantics that the original Named Graphs proposal by Carroll et al. 2005 [9] defined. Ultimately this needs to be fixed but in practice a simple workaround would suffice, e.g. a vocabulary extension that allows to augment graph identifiers with denotational semantics. I’m not aware of anybody doing that today, which seems strange (and I’d be glad for references to the contrary!).
Also I’m sceptical if Named Graphs are the right tool to define semantics for such a fundamental datastructure as lists. They might be if modelling lists is indeed only a secondary concern - the question is if that is really a viable positiion. Extending Named Graphs in SPARQL and RDF to the semantics as proposed in the original paper will be the topic of another mail.

Native support for lists per a new datatype

Lists as first class citizens in RDF, represented by their own datatype, are a completely different approach that keeps the simple triple structure and hides structural complexity inside nodes of a specific type.
Datatyped lists would indeed provide some interesting properties: lists would be encapsulated in single nodes and consequently the triple as the basic bearer of truth in RDF would be uncompromised. A malformed list wouldn’t pose any semantic problem to the triple or graph in which it occurrs. datatyped lists could be designed to make explicit all the semantics that are already implicitly available in the different types of Containers and Collection: ordering, duplicates, alternatives, limits, maybe even some more. Implementations could easily optimize performance of such lists.
IMO they would however also need a triple-based representation to nail down their semantics in RDF in a backwards compatible way. N3 develops a mixed approach where the syntactic sugar known from Turtle is interpreted as a proper datatype with its own methods (built-ins in N3 speak) but also mirrored in the rdf:List vocabulary [3]. The same list can occurr as bracketed syntactic sugar or a first-rest-ladder in the same RDF graph. In principle this approach could just as well be implemented with the rdf:Container vocabulary.
Some tricky semantic questions around identity arise as encapsulating a list in a node introduces a level of indirection. N3’s issue about the relation between an rdf:List represented as first-rest-ladder and the same list as a (…) datatype illustrates the problem. Earlier works suggest to interpret the first-rest-ladder as a kind of reification of the bracketed list but there remain open questions. Maybe the triplified version could be understood as owl:importing a datatyped list into the triple realm?
Proper identification always tends to have more facets than one expects and RDF has a record of oversimplifying that problem. OTOH issues about identification need to be resolved for good in more areas of RDF anyway - like denotation vs indication in URI semantics, Named Graph semantics in RDF 1.1, the upcoming RDF* semantics - and this might be a chance to tackle the problem in a principled way.
Other points need clarification too, e.g. attributions to a list node have to be disambiguated from attributions to individual list members (even if the attribution is made to all members). Pat provided a more detailed account of problems that will have to be solved [4] and David recorded them in the EasierRDF issue tracker [5].
Introducing more complex datatypes doesn’t need to be limited to lists but could also be used for reification (as RDF* proposes) and even n-ary relations etc. That would be quite a leap - however, as this represents a rather clean slate it’s probably better to start by thinking big.
Last not least introducing a native datatype in RDF requires quite some implementation effort throughout the whole Semantic Web tool chain. However RDF* is starting to proof that this can indeed happen if the need is understood and a rough consensus has been achieved.

How does it all fit together

A desirable outcome of the whole list/List issue IMO would be if in the long run:
- Containers became the go to solution for semantically lightweight descriptions of list-like data structures. Their informal Seq, Bag and Alt semantics seem to be intuitive and well understood. The proposed extension by a 'limit' attribute would satisfy a common need by data publishers to express that some description is complete. The bare syntax of rdf:Containers is bearable all things considered, unlike rdf:Lists. Still, syntactic sugar in Turtle and JSON-LD and a little more support in SPARQL would be very helpful.
- Constraints formulated in Shaecxl could take it from there and check and enforce such informally stated semantic intuitions and objectives in applications.
- Named Graphs may be used to denote environments where certain semantic restrictions apply and are enforced like e.g. lists being closed, default reasoning regime etc.
- Those requiring DL-safe constructs from the start or concentrating on DL environments anyway could alternatively resort to more involved design patterns for sequences.
- rdf:Lists were only used for OWL DL axiomatic statements.
- Lists as datatypes could complete this range of solutions. They would probbaly serve best the needs of application-oriented Linked Data-based data aggregation and publishing which doesn’t focus so much on modelling intricate knowledege structures and meta-modelling but on efficiency and high throughput.

Thomas

P.S.: Long emails take a lot of effort write and then nobody reads them because they are so long. But do complex discussions have to be constrained to academic papers and workshops? Of course I’m speaking out of self interest but I’d really like if this list would more be used for discussions about fundamental questions, for position statements, to discuss strategic directions for the Semantic Web.

(*) The paper by Daga et al. discusses list support in RDF stores in quite some detail. It suggest that it should be relatively easy to improve rdf:Container support in SPARQL considerably. It also notes that only querying was investigated and that in an update heavy scenario the first-rest-ladders of rdf:List might provide an advantage.
(**) I should probably be really specific and word this as "if one wants to describe a list as being closed" but you know what I mean
(***) I hope I understood Pat right when he wrote [6]: "Well, we could just have a semantic rule that says that IEXT(I(rdf:_n)) cannot contain < x, y> when rdfx:last contains <x, I(rdf:_m)> for any n>m. That would make statements about entries after the last one be inconsistent (always false) and allow applications to detect it by examining the innards of the membership properties, but not express it in RDF triples. But they could for example issue a potential error warning or take other special actions. "

[0] https://www.w3.org/TR/2004/REC-rdf-primer-20040210/
[1] https://www.w3.org/TR/rdf-mt/
[2] Daga et al. 2019 "Modelling and querying lists in RDF. A pragmatic study"
http://CEUR-WS.org/Vol-2496/paper2.pdf
[3] https://lists.w3.org/Archives/Public/semantic-web/2020May/0065.html
[4] https://lists.w3.org/Archives/Public/semantic-web/2020May/0069.html
[5] https://github.com/w3c/EasierRDF/issues/74
[6] https://lists.w3.org/Archives/Public/semantic-web/2020May/0108.html
[7] https://lists.w3.org/Archives/Public/semantic-web/2020May/0112.html
[8] RDF 1.1 Note on Named Graphs, http://www.w3.org/TR/rdf11-datasets/
[9] Carroll et al. 2005 "Named Graphs"
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3199260

Received on Sunday, 21 June 2020 18:36:32 UTC