Re: Take2: 15 Ways to Think About Data Quality (Just for a Start) from Kingsley Idehen on 2011-04-20 (public-lod@w3.org from April 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 20 Apr 2011 16:13:07 -0400
To: glenn mcdonald <glenn@furia.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <4DAF3E53.50409@openlinksw.com>
On 4/15/11 9:47 AM, glenn mcdonald wrote:
> This reminds me to come back to the point about what I initially
> called Directionality, and Dave improved to Modeling Consistency.
>
> Dave is right, I think, that in terms of data quality, it is
> consistency that matters, not directionality. That is, as long as we
> know that a president was involved in a presidency, it doesn't matter
> whether we know that because the president linked to the presidency,
> or the presidency linked to the president. In fact, in a relational
> database the president and the presidency and the link might even be
> in three separate tables. From a data-mathematical perspective, it
> doesn't matter. All of these are ways of expressing the same logical
> construct. We just want it to be done the same way for all
> presidents/presidencies/links.
>
> But although directionality is immaterial for data *quality*, it
> matters quite a bit for the usability of the system in which the data
> reaches people. We know, for example, that in the real world
> presidents have presidencies, and vice versa. But think about what it
> takes to find out whether this information is represented in a given
> dataset:
>
> - In a classic SQL-style relational database we probably have to just
> know the schema, as there's usually no exploratory way to find this
> kind of thing out. The RDBMS formalism doesn't usually represent the
> relationships between tables. You not only have to know it from
> external sources, but you have to restate it in each SQL join-query.
> This may be acceptable in a database with only a few tables, where the
> field-headings are kept consistent by convention, but it's extremely
> problematic when you're trying to combine formerly-separate datasets
> into large ones with multiple dimensions and purposes. If the LOD
> cloud were in relational tables, it would be awful. Arguably the main
> point of the cloud is to get the data out of relational tables (where
> most of it probably originates) into a graph where the connections are
> actually represented instead of implied.


Sorta. There is more to it re. Linked Data though. For instance, the 
object ids resolve to actual object representations via time tested 
de-reference (*) and address-of (&) style operator patterns via HTTP URI 
based Names and HTTP URI based Data Access Addresses (URLs), respectively.
> - But even in RDF, directionality poses a significant discovery
> problem.

Yes, assuming a single document with RDF content.

> In a minimal graph (let's say "minimal graph" means that each
> relationship is asserted in only one direction, so there's no
> relationship redundancy), you can't actually explore the data
> navigationally. You can't go to a single known point of interest, like
> a given president, and explore to find out everything the data holds
> and how it connects.

Well this is an aspect of most of LOD cloud cache demonstrations I put 
out. Given a Text Pattern, Entity Label, and URI, place me somewhere so 
that I can disambiguate my way to what I seek by navigating across isA 
and other relations that constitute the underlying Linked Data graph.

Thus, in our case it could be:

1. Pattern: "Obama"
2. Pattern: "Obama" in the Entity label
3. Actual known ID (URI) for a given Entity.

> You can explore the *outward* relationships from
> any given point, but to find out about the *inward* relationships you
> have to keep doing new queries over the entire dataset.

Yes, and not only that, you need to be able to allow the user page 
through the data using scrollable cursoring techniques. An old DBMS 
technique for handling voluminous result sets. Thus, you should be able 
to go to specific pages or a specific position, and then bookmark said 
position for future reference etc..

> The same basic
> issue applies to an XML representation of the data as a tree: you can
> squirrel your way down, but only in the direction the original modeler
> decided was "down". If you need a different direction, you have to
> hire a hypersquirrel.

Yes, but XML is a rooted graph. Thus, XML ingested into a graph store 
results in a relational graph. The important thing is the Entity ID 
handling post ingestion.
> - Of course, most RDF-presenting systems recognize this as a usability
> problem, and address it by turning the minimal graph into a redundant
> graph for UI purposes.

Not necessarily redundant when persisted and indexed in a relational 
property graph model DBMS. As per comment above, it ultimately boils 
down to the semantics expressed in the resulting graph. XML data sources 
as foundation for Linked Data graphs is something that underlies our 
sponger middleware and various cartridges. The cartridge effort is where 
the modeling occurs based on schema study and eventual remapping.

> Thus in a data-browser UI you usually see, for
> a given node, lists of both outward and inward relationships. This is
> better, but if this abstraction is done at the UI layer, you still
> lose it once you drop down into the SPARQL realm.

SPARQL realm should be about producing results for different consumers. 
If you are constructing a view for a user where graph position placement 
is one of the UX goals, then surfacing the Linked Data URIs in the 
result set works fine. Again, its one of the things I've been 
demonstrating since our initial ODE browser and iSPARQL QBE, both date 
back to 2007. What's newer is a set of interfaces that handle cursor 
based navigation over massive datasets stored in the Virtuoso DBMS. The 
browser won't explode, in a nutshell.

>   This makes the
> SPARQL queries harder to write, because you can't write them the way
> you logically think about the question, you have to write them the way
> the data thinks about the question.

Depends on the writer :-)

It also why we have a SPARQL link in place to show you what's being 
generated when you start with text patterns in our faceted navigation UI.

> And this skew from real logic to
> directional logic can make them *much* harder to understand or
> maintain, because the directionality obscures the purpose and reduces
> the self-documenting nature of the query.

Yes.
>
> All of this is *much* better, in usability terms, if the data is
> redundantly, bi-directionally connected all the way down to the level
> of abstraction at which you're working. Now you can explore to figure
> out what's there, and you can write your queries in the way that makes
> the most human sense. The artificicial skew between the logical
> structure and the representational structure has been removed. This is
> perfectly possible in an RDF-based system, of course, if the software
> either generates or infers the missing inverses.

Yes, and that's what we do. And it works at massive scale.

> We incur extra
> machine overhead to reduce the human congnitive burden. I contend this
> should be considered a nearly-mandatory best-practice for linked data,
> and that propogating inverses around the LOD cloud ought to be one of
> things that makes the LOD cloud *a thing*, rather than just a
> collection of logical silos.

Yes, and that's what we believe too, and have executed on that via the 
LOD cloud cache we maintain.

On a related note, re. data quality matters in general, some excerpts 
from an 2009 post about data quality [1]:

“You don’t talk about data quality.”

No, wait—that’s The First Rule of Poor Quality Data.

The First Law of Data Quality:
“Data is either being used or waiting to be used—or wasting storage and 
support.”
Although understanding your data is essential to using it effectively 
and improving its quality, as Thomas Redman explains, “it is a waste of 
effort to improve the quality of data no one ever uses.”


In the context of Linked Data surmounting the essence of the above has 
been our focal point from day one. The data has to be out there for 
quality issues to surface albeit subjectively.


Link:

1. http://www.dataroundtable.com/?p=1458

-- 

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Wednesday, 20 April 2011 20:13:30 UTC