RE: Question on User Story S33: Normalizing data

* Irene Polikoff <irene@topquadrant.com> [2014-11-25 02:36-0500]
> Eric,
> 
> Are you then simply saying that sometimes there is bad data – may be created because there is not enough controls in the application producing it or may be resulting from a merge of two sources or whatever – and running validation is a way to identify such bad data?

sort of, though in my case, I'm identifying good data.


> If so, yes, totally agree. This is one of the key reasons to run constraint checking and we have many use cases and examples of this.

I'm happy to withdraw the use case. The story it tells is about how with validation, one can confidently write simple analytical queries; without validation, the intuitive queries will consume invalid data and which will skew the results. Is there another story which covers that?


> I am not sure why the story is called “normalizing data patterns” though, as normalizing implies that the data is changed as opposed to rejected or identified. I would call it “validating data” or may be “validating data as part of ETL/ELT”.

Fair enough, "normalizing" implies action taken on the abnormal data. I've change it to "Structural validation for queriability". Does that seem reasonable?


> Is there some significance in that the constraint is described as being for a patient? This actually seems to be about cardinality constraints for the procedure and procedure result.

Basing it in clinical data doesn't produce unique structural requirements but it is an envorinment where there is plenty of invalid data and legal reasons not to change it. It's also, IMO, a huge market for RDF; my main reasons for organizing the validation workshop 14 months ago was to make sure that RDF would be a viable technology for clinical data by the time we really got the attention of clinical informaticists.


> Irene
> 
>  
> 
> From: ericw3c@gmail.com [mailto:ericw3c@gmail.com] On Behalf Of Eric Prud'hommeaux
> Sent: Tuesday, November 25, 2014 1:49 AM
> To: Irene Polikoff
> Cc: public-data-shapes-wg; Holger Knublauch
> Subject: RE: Question on User Story S33: Normalizing data
> 
>  
> 
> 
> On Nov 25, 2014 4:20 AM, "Irene Polikoff" <irene@topquadrant.com> wrote:
> >
> > So, is this about validating that the assessment test with two results is OK
> > as long as both results have the same coding term and the same assessor? And
> > not about reshaping the data?
> 
> In my use case, any data that did not pass the designated shape would be rejected. That includes the example broken data.
> 
> > Or is it about actually changing the data to collapse the two results into a
> > single one along the lines of pre-processing before the validation as Karen
> > may be suggesting?
> 
> Where mine is about selecting conformant data, Karen's is probably more easily accomplished by selecting non-conformant data, which could then be normalized via various heuristics and rules.
> 
> > -----Original Message-----
> > From: Eric Prud'hommeaux [mailto:eric@w3.org]
> > Sent: Monday, November 24, 2014 7:56 AM
> > To: Holger Knublauch
> > Cc: public-data-shapes-wg@w3.org
> > Subject: Re: Question on User Story S33: Normalizing data
> >
> > * Holger Knublauch <holger@topquadrant.com> [2014-11-21 09:38+1000]
> > > Hi Eric,
> > >
> > > I have a question on the User Story S33 that you added recently:
> > >
> > > https://www.w3.org/2014/data-shapes/wiki/User_Stories#S33:_Normalizing
> > > _data_patterns_for_simple_query
> > >
> > > You describe the requirement to normalize data - I guess automatically
> > > to drop extra duplicate entries? Could you clarify how this would work
> > > in practice: is your assumption that if there are two identical blank
> > > nodes (like in your example) then the system could delete one of them?
> > > What about cases where the two blank nodes have slight differences -
> > > would this also be covered and how? Is this about automatically fixing
> > > constraint violations?
> >
> > This wasn't about repairing the data, merely identifying a conformant
> > dataset over which SPARQL queries can be executed without exhaustive error
> > checking. The example I provided would be pretty trivial to repair (I edited
> > it to clarify that it's a simplification), but there are lots of ways the
> > data can be broken and executing rules to normalize that data requires
> > serious babysitting, and would generally be decoupled from analysis. Medical
> > record custodians are typically risk-adverse and researchers are typically
> > happy with representative subsets of the data. The same validation can be
> > used by the custodians, if they every decide they'd like to clean up.
> >
> >
> > > Thanks for clarification
> > > Holger
> > >
> > >
> >
> > --
> > -ericP
> >
> > office: +1.617.599.3509
> > mobile: +33.6.80.80.35.59
> >
> > (eric@w3.org)
> > Feel free to forward this message to any list for any purpose other than
> > email address distribution.
> >
> > There are subtle nuances encoded in font variation and clever layout which
> > can only be seen by printing this message on high-clay paper.
> >
> >
> 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.

Received on Tuesday, 25 November 2014 08:20:01 UTC