Re: Data Model Assumptions from Ivan Herman on 2015-08-18 (public-annotation@w3.org from August 2015)

From: Ivan Herman <ivan@w3.org>
Date: Tue, 18 Aug 2015 12:57:25 +0200
To: Doug Schepers <schepers@w3.org>
Cc: W3C Public Annotation List <public-annotation@w3.org>
Message-Id: <9A34C6C5-58C2-4AE9-8244-020F8E2B9574@w3.org>
Hi Doug,

Thanks. Not taking side, just making some statements a bit more precise...

> On 18 Aug 2015, at 07:16 , Doug Schepers <schepers@w3.org> wrote:
> 
> Hi, folks–
> 
> During a conversation with Rob, Frederick, and Ivan, we realized that we have different conceptions about what the core of the "data model" is, which has led to some of the misunderstandings about what is possible and desirable.
> 
> 
> My idea of the Data Model has always rested on the notion of objects with properties, which is informed by my JavaScript background.
> 
> The way I've been thinking about the data model is a set of object with child objects and properties, where the properties are name-value pairs:
> * we have an Annotation object, with some properties like id, author, timestamp and other provenance properties, and a role/motivation;
> * the Annotation object also one or more child objects of Body or Target type:
> ** the Body object has properties like id, type, format, language, and value/content
> ** the Target object has properties like id, type, source, and one or more Selector objects
> *** the Selector object has properties like id, type, value, and other type-specific properties
> 
> Thus, it seems perfectly normal that we can add arbitrary properties, or even objects, to any of these objects, in order to add information about it, or to deliberately move properties from the parent object to one of the child objects to change where the level of specificity is defined.
> 
> For example, as in the copy-edit use case, if there are two different types of Body, one for the replacement text and one for comments or explanation about a replacement, moving the role/motivation property from the Annotation object to the child Body objects seems reasonable.
> 
> This relatively unstructured, self-contained object-property system was the full extent of my notion of the data model.
> 
> The data model, of course, is separate from the serialization, which could be expressed as JSON, JSON-LD, HTML, Turtle, or whatever other format is desired.
> 
> 
> Others in the WG, especially those from the Open Annotation Community Group, seem to have an additional set of constraints on top of this object-property data model, as RDF or Linked Data. I can't claim to understand all the details, but it seems to consist of at least:
> * strong datatyping, with a URI-reference system to type definitions

Strong datatyping is an option, not a requirement. It is perfectly o.k., from an RDF point of view, to have plain texts without any datatypes attached to them (well, formally speaking, they are then xsd:string datatyped, but that is only the formalism).

I think the only point where there *is* some sort of a strong datatyping is the strict differentiation when a string of the form "http://example.org" is just a string or when it is, in fact, a representation of a URI that is to be de-referenced if the user/client/whatever so wishes.

> * a subject–predicate–object triple "grammar" for the objects and properties
> * unusual, but apparently optional, "predicate" names (e.g. "hasBody")

That is not part of any kind of any RDF standard, it is just the habits that a particular community has (often inherited from people who defined vocabularies, library catalogues, etc, way before even the Web existed).

> * a requirement that each object (or subobject) be independently addressable on the Web

Although I have a disagreement with Rob on this, s/each/most/ in my view.

We have to differentiate between RDF as a standard and Linked Data in this respect.

- The RDF standard has the notion of blank nodes, which are perfectly ok node types that one can use as subjects or objects, and do not have web addressable identity. They have a peculiar behaviour in terms of semantics (therein lies many of the controversies around them) but we should not get into those here; we can safely ignore that.

- Linked Data is a set of, shall we say, best practice guidelines and deployment strategies that have evolved over the years, but does not have a formal standing like RDF has. As a consequence, there is no precise definition for it, nor is there a standard that governs the do-s and the don't-s. The Linked Data community (sorry, some members of that community) are very vocal on rejecting the usage of blank nodes altogether, or almost altogether. In many situations they are right, but taking that rejection religiously is not universally accepted by everyone in that community either. (See also below my note on integration.)

> * a notation that expresses each name-value property pair as an "assertion", where each assertion has a global scope not confined to the annotation

Correct, but with the additional caveat that the global scope is really there when the subject *is* a Web dereferencable resource. If the subject is a blank node, the assertion is not really global any more.

> * a peculiar behavior around lists (which I don't really understand)
> 

If you have ever looked at the way lists are represented in Lisp, then you got it: lists in RDF are, essentially, Lisp lists, expressed through RDF triples.

> (Please correct me if I'm wrong.)
> 

One essential addition at this point (this is important for the discussion): many of these features, more exactly their additional complication, can be hidden in the particular case of a JSON-LD serialization. The property names can be set to whatever we want, lists can be used almost exactly as lists in JSON, in most of our practical cases the differentiation between URI-s as strings and URI-s as identifiers can be forgotten (because the property that is used for that purpose will automatically determine which of the two views is used), blank nodes can be created easily on the fly, etc.

Turtle is more restrictive in one sense (there is no aliasing possibility for property names), but that is a language for the RDF community anyway. RDFa does not have aliasing either, nor do microdata.

> The consequence of some combination of these additional constraints seems to impose a rigid syntactic/semantic object structure that makes it more difficult to express objects with flexible property specificity. This leads to an object structure with additional nesting and sets of properties that I don't personally find intuitive, and which I suspect other JavaScript developers won't either.
> 
> Again, the example of the copy-edit use case, with roles/motivations on the body, seems to be difficult to express concisely or simply.
> 
> That said, structuring the annotation objects this way seems to add some ability to parse the annotation through an "RDF reasoner" to help make derivative assertions about the annotation body and target, with other annotations or data. I am not totally clear on this, but I'm open to the idea that this has some important effects.
> 

There again, opinions may differ, but the reasoning aspect is, in my view, the least interesting. Actually, there are many RDF environments around that do only a minimal reasoning (if any) out of the box.

I think the main advantage of RDF (and Linked Data) is the ability to do data integration, and possibly making queries (eg, in SPARQL) on a virtually integrated set of data with very little extra difficulties. Ie, I can combine the resources in my annotations with information on DBPedia, with Google's knowledge graph (some of it is accessible as RDF, namely the information stored using schema.org), with bibliographic data that a particular library may make available in RDF, or with loads of additional datasets that are all interlinked in that massive Linked Data cloud. I am not saying it is easy to do because the sheer size makes it difficult, I am not saying all problems are solved, but that is certainly the ideal of Linked Data with an active community working on it. And, of course, annotation data can be integrated with other annotation data:-)

Data integration is, again in my view, the most important selling point of RDF (and Linked Data), not reasoning.

This is where we can circle back to the blank node issue: what one should always consider, when deciding whether a resource should stay as a blank node or whether it is to get a Web URI, is whether that particular resource has any reason to be integrated with other resources per se, or whether it is always bound to be internal. If the latter, URI-s become just an unnecessary chore.

Thanks

Ivan

> 
> So, by all agreeing that we would start with the Open Annotation Data Model as a starting point, we seem to have been agreeing to different fundamental understandings of what that data model consists of:
> 1) a nested object-property data model; or
> 2) an RDF triple data model, with all the concomitant constraints.
> 
> I hope I've characterized it fairly, and that we can use this shared understanding to better discuss what we want and need. If not, I welcome a more accurate description of these two data models.
> 
> 
> With that as the (rough) basis, I'd like to extrapolate a bit.
> 
> 
> One could reasonably argue that the standardized interchange format between annotation applications should be the simplest common set of features, perhaps with some low-cost extras that fit nicely and which enhance the format in a way that enables the minimum viable product for the most prevalent apps. The simple object-based data model I've described above is very much in line with that goal; it conveys the necessary information that would allow a large number of apps and services to model their data for lossless interchange, with a minimum of extra development work. Following a design principle like this creates a strong incentive towards, and prevents a disincentive against, adoption by vendors.
> 
> By contrast, inheriting a set of additional requirements from Linked Data/RDF increases the complexity of the model, both in the number and type of properties and in the rigidity of the structure of the data. So, as a measure of the universality of appeal and ease of adoption, requiring Linked Data/RDF is an additional burden that should not be part of the simplest possible data model.
> 
> However, I'm not going so far as that, for two reasons:
> * There are many existing vendors who do want the features that are available (only?) through Linked Data/RDF
> * It's possible that some of these features may add significant value above and beyond what the minimum viable data model would include, and thus be a more tempting implementation target.
> 
> If this is what we as a WG believe, then we should clearly identify and communicate what value is added by the addition of these design constraints, in a concise, concrete, and compelling explanation. I don't believe it's enough to cite conformance to some document of architectural principles without describing precisely how these benefits convey at the level we're talking about.
> 
> In addition, I think we should continue to strive to make the smallest possible impact on complexity of understanding (for Web developers) and implementation (for vendors). We've taken steps in that direction, and I'd like to see that continue.
> 
> 
> I feel like I'm probably in the minority on some of these views (within the WG, not necessarily in the wider developer community), so if anyone (inside the WG or outside of it) shares similar notions, I'd appreciate hearing from you.
> 
> Regards–
> –Doug
> 


----
Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Tuesday, 18 August 2015 10:57:35 UTC