- From: glenn mcdonald <glenn@furia.com>
- Date: Fri, 15 Apr 2011 09:47:24 -0400
- To: Kingsley Idehen <kidehen@openlinksw.com>
- Cc: "public-lod@w3.org" <public-lod@w3.org>
This reminds me to come back to the point about what I initially called Directionality, and Dave improved to Modeling Consistency. Dave is right, I think, that in terms of data quality, it is consistency that matters, not directionality. That is, as long as we know that a president was involved in a presidency, it doesn't matter whether we know that because the president linked to the presidency, or the presidency linked to the president. In fact, in a relational database the president and the presidency and the link might even be in three separate tables. From a data-mathematical perspective, it doesn't matter. All of these are ways of expressing the same logical construct. We just want it to be done the same way for all presidents/presidencies/links. But although directionality is immaterial for data *quality*, it matters quite a bit for the usability of the system in which the data reaches people. We know, for example, that in the real world presidents have presidencies, and vice versa. But think about what it takes to find out whether this information is represented in a given dataset: - In a classic SQL-style relational database we probably have to just know the schema, as there's usually no exploratory way to find this kind of thing out. The RDBMS formalism doesn't usually represent the relationships between tables. You not only have to know it from external sources, but you have to restate it in each SQL join-query. This may be acceptable in a database with only a few tables, where the field-headings are kept consistent by convention, but it's extremely problematic when you're trying to combine formerly-separate datasets into large ones with multiple dimensions and purposes. If the LOD cloud were in relational tables, it would be awful. Arguably the main point of the cloud is to get the data out of relational tables (where most of it probably originates) into a graph where the connections are actually represented instead of implied. - But even in RDF, directionality poses a significant discovery problem. In a minimal graph (let's say "minimal graph" means that each relationship is asserted in only one direction, so there's no relationship redundancy), you can't actually explore the data navigationally. You can't go to a single known point of interest, like a given president, and explore to find out everything the data holds and how it connects. You can explore the *outward* relationships from any given point, but to find out about the *inward* relationships you have to keep doing new queries over the entire dataset. The same basic issue applies to an XML representation of the data as a tree: you can squirrel your way down, but only in the direction the original modeler decided was "down". If you need a different direction, you have to hire a hypersquirrel. - Of course, most RDF-presenting systems recognize this as a usability problem, and address it by turning the minimal graph into a redundant graph for UI purposes. Thus in a data-browser UI you usually see, for a given node, lists of both outward and inward relationships. This is better, but if this abstraction is done at the UI layer, you still lose it once you drop down into the SPARQL realm. This makes the SPARQL queries harder to write, because you can't write them the way you logically think about the question, you have to write them the way the data thinks about the question. And this skew from real logic to directional logic can make them *much* harder to understand or maintain, because the directionality obscures the purpose and reduces the self-documenting nature of the query. All of this is *much* better, in usability terms, if the data is redundantly, bi-directionally connected all the way down to the level of abstraction at which you're working. Now you can explore to figure out what's there, and you can write your queries in the way that makes the most human sense. The artificicial skew between the logical structure and the representational structure has been removed. This is perfectly possible in an RDF-based system, of course, if the software either generates or infers the missing inverses. We incur extra machine overhead to reduce the human congnitive burden. I contend this should be considered a nearly-mandatory best-practice for linked data, and that propogating inverses around the LOD cloud ought to be one of things that makes the LOD cloud *a thing*, rather than just a collection of logical silos.
Received on Friday, 15 April 2011 13:48:11 UTC