Re: Take2: 15 Ways to Think About Data Quality (Just for a Start) from glenn mcdonald on 2011-04-15 (public-lod@w3.org from April 2011)

From: glenn mcdonald <glenn@furia.com>
Date: Fri, 15 Apr 2011 09:47:24 -0400
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <BANLkTi=YkJXiJT9BN4kCgokrtf37rOxuEw@mail.gmail.com>
This reminds me to come back to the point about what I initially
called Directionality, and Dave improved to Modeling Consistency.

Dave is right, I think, that in terms of data quality, it is
consistency that matters, not directionality. That is, as long as we
know that a president was involved in a presidency, it doesn't matter
whether we know that because the president linked to the presidency,
or the presidency linked to the president. In fact, in a relational
database the president and the presidency and the link might even be
in three separate tables. From a data-mathematical perspective, it
doesn't matter. All of these are ways of expressing the same logical
construct. We just want it to be done the same way for all
presidents/presidencies/links.

But although directionality is immaterial for data *quality*, it
matters quite a bit for the usability of the system in which the data
reaches people. We know, for example, that in the real world
presidents have presidencies, and vice versa. But think about what it
takes to find out whether this information is represented in a given
dataset:

- In a classic SQL-style relational database we probably have to just
know the schema, as there's usually no exploratory way to find this
kind of thing out. The RDBMS formalism doesn't usually represent the
relationships between tables. You not only have to know it from
external sources, but you have to restate it in each SQL join-query.
This may be acceptable in a database with only a few tables, where the
field-headings are kept consistent by convention, but it's extremely
problematic when you're trying to combine formerly-separate datasets
into large ones with multiple dimensions and purposes. If the LOD
cloud were in relational tables, it would be awful. Arguably the main
point of the cloud is to get the data out of relational tables (where
most of it probably originates) into a graph where the connections are
actually represented instead of implied.

- But even in RDF, directionality poses a significant discovery
problem. In a minimal graph (let's say "minimal graph" means that each
relationship is asserted in only one direction, so there's no
relationship redundancy), you can't actually explore the data
navigationally. You can't go to a single known point of interest, like
a given president, and explore to find out everything the data holds
and how it connects. You can explore the *outward* relationships from
any given point, but to find out about the *inward* relationships you
have to keep doing new queries over the entire dataset. The same basic
issue applies to an XML representation of the data as a tree: you can
squirrel your way down, but only in the direction the original modeler
decided was "down". If you need a different direction, you have to
hire a hypersquirrel.

- Of course, most RDF-presenting systems recognize this as a usability
problem, and address it by turning the minimal graph into a redundant
graph for UI purposes. Thus in a data-browser UI you usually see, for
a given node, lists of both outward and inward relationships. This is
better, but if this abstraction is done at the UI layer, you still
lose it once you drop down into the SPARQL realm. This makes the
SPARQL queries harder to write, because you can't write them the way
you logically think about the question, you have to write them the way
the data thinks about the question. And this skew from real logic to
directional logic can make them *much* harder to understand or
maintain, because the directionality obscures the purpose and reduces
the self-documenting nature of the query.


All of this is *much* better, in usability terms, if the data is
redundantly, bi-directionally connected all the way down to the level
of abstraction at which you're working. Now you can explore to figure
out what's there, and you can write your queries in the way that makes
the most human sense. The artificicial skew between the logical
structure and the representational structure has been removed. This is
perfectly possible in an RDF-based system, of course, if the software
either generates or infers the missing inverses. We incur extra
machine overhead to reduce the human congnitive burden. I contend this
should be considered a nearly-mandatory best-practice for linked data,
and that propogating inverses around the LOD cloud ought to be one of
things that makes the LOD cloud *a thing*, rather than just a
collection of logical silos.
Received on Friday, 15 April 2011 13:48:11 UTC