Re: Question on User Story S33: Normalizing data from Eric Prud'hommeaux on 2014-11-24 (public-data-shapes-wg@w3.org from November 2014)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 24 Nov 2014 07:56:08 -0500
To: Holger Knublauch <holger@topquadrant.com>
Cc: public-data-shapes-wg@w3.org
Message-ID: <20141124125606.GA14062@w3.org>

* Holger Knublauch <holger@topquadrant.com> [2014-11-21 09:38+1000]
> Hi Eric,
> 
> I have a question on the User Story S33 that you added recently:
> 
> https://www.w3.org/2014/data-shapes/wiki/User_Stories#S33:_Normalizing_data_patterns_for_simple_query
> 
> You describe the requirement to normalize data - I guess
> automatically to drop extra duplicate entries? Could you clarify how
> this would work in practice: is your assumption that if there are
> two identical blank nodes (like in your example) then the system
> could delete one of them? What about cases where the two blank nodes
> have slight differences - would this also be covered and how? Is
> this about automatically fixing constraint violations?

This wasn't about repairing the data, merely identifying a conformant
dataset over which SPARQL queries can be executed without exhaustive
error checking. The example I provided would be pretty trivial to
repair (I edited it to clarify that it's a simplification), but there
are lots of ways the data can be broken and executing rules to
normalize that data requires serious babysitting, and would generally
be decoupled from analysis. Medical record custodians are typically
risk-adverse and researchers are typically happy with representative
subsets of the data. The same validation can be used by the custodians,
if they every decide they'd like to clean up.


> Thanks for clarification
> Holger
> 
> 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.

Received on Monday, 24 November 2014 12:56:15 UTC