W3C home > Mailing lists > Public > public-lod@w3.org > April 2011

Re: 15 Ways to Think About Data Quality (Just for a Start)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 12 Apr 2011 15:37:10 -0400
Message-ID: <4DA4A9E6.7090306@openlinksw.com>
To: glenn mcdonald <glenn@furia.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
On 4/12/11 3:02 PM, glenn mcdonald wrote:
>>     So yes, I think you should feel a little embarrassed about
>>     broadcasting links to a demo in which the very first piece of
>>     data one sees is obviously wrong.
>     To you the first piece of that is an owl:sameAs assertion. That's
>     100% fine for you, but that isn't true for everyone else. It just
>     isn't.
> Why, is the page dynamically reconfigured for other people?

As per my latest post. It's just a point of view. You are now talking 
about UI aesthetics rather than data quality. The presentation layer is 
just that, a presentation layer. The Data layer is just that, a Data Layer.

> I'm not saying "first" in some mushy philosophical sense, I'm talking 
> about the first attribute that appears in the structured-data section 
> of the page, right under the headings "Attributes" and "Values".
Because out of 21 Billion+ records why should the page order by 
perceived quality of assertion in an owl:sameAs relation? Why? Because 
it might bug you? Is there an inherent semantic in Links that infers:

1. Thou must click
2. Thou must click and infer
3. Thous must infer?

Moreover, the issue with OpenCyc links to and from DBpedia (not 
performed by me or anyone at OpenLink Software)  is something that is 
going to be resolved when OpenCyc release a new linkset.

There's absolutely nothing wrong with a page that immediately brings to 
attention misuse or dangerous use of owl:sameAs. You (as a cognitively 
endowed being) see the page on one context, that fine. But others will 
also look at the same page and see things differently. This is the very 
basis of cognition. We are wired to see things differently. IMHO a 
clever feature inherited from our universe. Imagine if we could only 
observe the same limited dimensions of an observation subject?

The presentation is the page != a position about how I feel about data 
quality. It's is just a presentation of data that's loosely coupled to 
its data sources. You can even take the source code of the page and 
tweak it for your specific needs if you like. That's what this is 
supposed to be about.

I could start to understand your view point if my presentation, data 
sources etc.. where imposed on you etc. That simply isn't the case, and 
that's 100% antithetical to the concept of Linked Data that I am 
particularly excited about i.e., the loose coupling knowledge, 
information, and data that inherently facilitates free remixing and 
sharing of: data sources, queries, and presentation pages.

>>     You've got billions of entities in dbpedia, and the technology
>>     doesn't care which one you pick, so surely you could pick one
>>     where the errors aren't as prominent.
>     No, DBpedia doesn't have a billions of entities, that just one
>     dataset.
> What? Whatever: you've got plenty of other entities, so surely you 
> could pick one where the errors aren't as prominent. Here, for 
> example, is the next one I tried:
> http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FTori_Amos

Again, I pick examples like 'Micheal Jackson' because like 'New York', 
'Paris' etc., my focal point is/was: use of entity type and other 
attributes as mechanism for disambiguating my quests for information 
about a specific entity, at massive scales. The aforementioned entity 
examples ultimately accentuate the challenge at hand.

I won't drop triples in the OpenCyc Named Graph simply because of a few 
questionable relations potentially upsetting a few observers. I am more 
interested in real demos, and that means bad or questionable data warts 
are part of the package. Exercises like this have triggered many a 
dataset fix in LOD land. You'd be quite surprised (bearing in mind your 
perception of my data quality values) chow many dataset producers I've 
worked with re. data fixes across the ABox and TBox realms.

> There are some dubious bits to this, too (she only "composed" one 
> song?** a person is "subsequent work" of a song?***), but at least 
> this is a page about a person that appears to be about a single 
> person. Same technology, better "demo".

No, your demo of the same technology. That's a better characterization. 
Again, the inherent tone of your commentary continues to echo a 
contentious problem: you can always speak for yourself, just done speak 
for me. We are individuals (in a ! owl:sameAs relation).

>     In due course you will understand my point.
> Understood your points the first hundred times you stated them. Any 
> time you'd like to take a turn understanding mine, feel free.

Open the door first i.e., stop telling me about myself.

We can have a conversation, we've had many in the past. All you have to 
do is open the door.

>     You characterization is 100% inaccurate.
> In the context of your insistence on the subjectivity of everything, I 
> assume this is intended as a joke. Funnier without the typo.
> **Completeness failure
> ***Modeling Correctness error

Yes, LOL re. typo too.

Here's a excuse (as you would perceive it): I have a wonky "R" key and I 
kinda type very fast cos I multitask 99.99% of the time. Pass my typos 
through a character analyzer to see what I mean re. prevalence of 
missing "R" . My keypad actually accepts my hits but doesn't always 
invoke the production of an actual "R", funny but true!



Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Tuesday, 12 April 2011 19:37:34 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:13 UTC