Re: Take2: 15 Ways to Think About Data Quality (Just for a Start)

Hi Glenn,

thanks a lot for your insightful thoughts. I think, I can fully agree to 
them. This topic reminds me a bit of a question I stated some time ago 
on SemanticOverflow (now answers.semanticweb.com):

"When should I use explicit/anonymous defined inverse properties?" [1]

(btw, this question is still not marked as "answered" ;) )

Cheers,


Bob


[1] 
http://answers.semanticweb.com/questions/1126/when-should-i-use-explicitanonymous-defined-inverse-properties 


On 4/15/2011 3:47 PM, glenn mcdonald wrote:
> This reminds me to come back to the point about what I initially
> called Directionality, and Dave improved to Modeling Consistency.
>
> Dave is right, I think, that in terms of data quality, it is
> consistency that matters, not directionality. That is, as long as we
> know that a president was involved in a presidency, it doesn't matter
> whether we know that because the president linked to the presidency,
> or the presidency linked to the president. In fact, in a relational
> database the president and the presidency and the link might even be
> in three separate tables. From a data-mathematical perspective, it
> doesn't matter. All of these are ways of expressing the same logical
> construct. We just want it to be done the same way for all
> presidents/presidencies/links.
>
> But although directionality is immaterial for data *quality*, it
> matters quite a bit for the usability of the system in which the data
> reaches people. We know, for example, that in the real world
> presidents have presidencies, and vice versa. But think about what it
> takes to find out whether this information is represented in a given
> dataset:
>
> - In a classic SQL-style relational database we probably have to just
> know the schema, as there's usually no exploratory way to find this
> kind of thing out. The RDBMS formalism doesn't usually represent the
> relationships between tables. You not only have to know it from
> external sources, but you have to restate it in each SQL join-query.
> This may be acceptable in a database with only a few tables, where the
> field-headings are kept consistent by convention, but it's extremely
> problematic when you're trying to combine formerly-separate datasets
> into large ones with multiple dimensions and purposes. If the LOD
> cloud were in relational tables, it would be awful. Arguably the main
> point of the cloud is to get the data out of relational tables (where
> most of it probably originates) into a graph where the connections are
> actually represented instead of implied.
>
> - But even in RDF, directionality poses a significant discovery
> problem. In a minimal graph (let's say "minimal graph" means that each
> relationship is asserted in only one direction, so there's no
> relationship redundancy), you can't actually explore the data
> navigationally. You can't go to a single known point of interest, like
> a given president, and explore to find out everything the data holds
> and how it connects. You can explore the *outward* relationships from
> any given point, but to find out about the *inward* relationships you
> have to keep doing new queries over the entire dataset. The same basic
> issue applies to an XML representation of the data as a tree: you can
> squirrel your way down, but only in the direction the original modeler
> decided was "down". If you need a different direction, you have to
> hire a hypersquirrel.
>
> - Of course, most RDF-presenting systems recognize this as a usability
> problem, and address it by turning the minimal graph into a redundant
> graph for UI purposes. Thus in a data-browser UI you usually see, for
> a given node, lists of both outward and inward relationships. This is
> better, but if this abstraction is done at the UI layer, you still
> lose it once you drop down into the SPARQL realm. This makes the
> SPARQL queries harder to write, because you can't write them the way
> you logically think about the question, you have to write them the way
> the data thinks about the question. And this skew from real logic to
> directional logic can make them *much* harder to understand or
> maintain, because the directionality obscures the purpose and reduces
> the self-documenting nature of the query.
>
>
> All of this is *much* better, in usability terms, if the data is
> redundantly, bi-directionally connected all the way down to the level
> of abstraction at which you're working. Now you can explore to figure
> out what's there, and you can write your queries in the way that makes
> the most human sense. The artificicial skew between the logical
> structure and the representational structure has been removed. This is
> perfectly possible in an RDF-based system, of course, if the software
> either generates or infers the missing inverses. We incur extra
> machine overhead to reduce the human congnitive burden. I contend this
> should be considered a nearly-mandatory best-practice for linked data,
> and that propogating inverses around the LOD cloud ought to be one of
> things that makes the LOD cloud *a thing*, rather than just a
> collection of logical silos.

Received on Friday, 15 April 2011 14:54:33 UTC