RE: Question on User Story S33: Normalizing data from Irene Polikoff on 2014-11-25 (public-data-shapes-wg@w3.org from November 2014)

From: Irene Polikoff <irene@topquadrant.com>
Date: Mon, 24 Nov 2014 22:19:48 -0500
To: "'Eric Prud'hommeaux'" <eric@w3.org>, "'Holger Knublauch'" <holger@topquadrant.com>
Cc: <public-data-shapes-wg@w3.org>
Message-ID: <24d201d0085e$abb50660$031f1320$@topquadrant.com>

So, is this about validating that the assessment test with two results is OK
as long as both results have the same coding term and the same assessor? And
not about reshaping the data?

Or is it about actually changing the data to collapse the two results into a
single one along the lines of pre-processing before the validation as Karen
may be suggesting?

-----Original Message-----
From: Eric Prud'hommeaux [mailto:eric@w3.org] 
Sent: Monday, November 24, 2014 7:56 AM
To: Holger Knublauch
Cc: public-data-shapes-wg@w3.org
Subject: Re: Question on User Story S33: Normalizing data

* Holger Knublauch <holger@topquadrant.com> [2014-11-21 09:38+1000]
> Hi Eric,
> 
> I have a question on the User Story S33 that you added recently:
> 
> https://www.w3.org/2014/data-shapes/wiki/User_Stories#S33:_Normalizing
> _data_patterns_for_simple_query
> 
> You describe the requirement to normalize data - I guess automatically 
> to drop extra duplicate entries? Could you clarify how this would work 
> in practice: is your assumption that if there are two identical blank 
> nodes (like in your example) then the system could delete one of them? 
> What about cases where the two blank nodes have slight differences - 
> would this also be covered and how? Is this about automatically fixing 
> constraint violations?

This wasn't about repairing the data, merely identifying a conformant
dataset over which SPARQL queries can be executed without exhaustive error
checking. The example I provided would be pretty trivial to repair (I edited
it to clarify that it's a simplification), but there are lots of ways the
data can be broken and executing rules to normalize that data requires
serious babysitting, and would generally be decoupled from analysis. Medical
record custodians are typically risk-adverse and researchers are typically
happy with representative subsets of the data. The same validation can be
used by the custodians, if they every decide they'd like to clean up.


> Thanks for clarification
> Holger
> 
> 

--
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout which
can only be seen by printing this message on high-clay paper.

Received on Tuesday, 25 November 2014 03:20:26 UTC