RE: Question on User Story S33: Normalizing data

Eric,

 

Are you then simply saying that sometimes there is bad data – may be created because there is not enough controls in the application producing it or may be resulting from a merge of two sources or whatever – and running validation is a way to identify such bad data?

 

If so, yes, totally agree. This is one of the key reasons to run constraint checking and we have many use cases and examples of this.

 

I am not sure why the story is called “normalizing data patterns” though, as normalizing implies that the data is changed as opposed to rejected or identified. I would call it “validating data” or may be “validating data as part of ETL/ELT”.

 

Is there some significance in that the constraint is described as being for a patient? This actually seems to be about cardinality constraints for the procedure and procedure result.

 

Irene

 

From: ericw3c@gmail.com [mailto:ericw3c@gmail.com] On Behalf Of Eric Prud'hommeaux
Sent: Tuesday, November 25, 2014 1:49 AM
To: Irene Polikoff
Cc: public-data-shapes-wg; Holger Knublauch
Subject: RE: Question on User Story S33: Normalizing data

 


On Nov 25, 2014 4:20 AM, "Irene Polikoff" <irene@topquadrant.com> wrote:
>
> So, is this about validating that the assessment test with two results is OK
> as long as both results have the same coding term and the same assessor? And
> not about reshaping the data?

In my use case, any data that did not pass the designated shape would be rejected. That includes the example broken data.

> Or is it about actually changing the data to collapse the two results into a
> single one along the lines of pre-processing before the validation as Karen
> may be suggesting?

Where mine is about selecting conformant data, Karen's is probably more easily accomplished by selecting non-conformant data, which could then be normalized via various heuristics and rules.

> -----Original Message-----
> From: Eric Prud'hommeaux [mailto:eric@w3.org]
> Sent: Monday, November 24, 2014 7:56 AM
> To: Holger Knublauch
> Cc: public-data-shapes-wg@w3.org
> Subject: Re: Question on User Story S33: Normalizing data
>
> * Holger Knublauch <holger@topquadrant.com> [2014-11-21 09:38+1000]
> > Hi Eric,
> >
> > I have a question on the User Story S33 that you added recently:
> >
> > https://www.w3.org/2014/data-shapes/wiki/User_Stories#S33:_Normalizing
> > _data_patterns_for_simple_query
> >
> > You describe the requirement to normalize data - I guess automatically
> > to drop extra duplicate entries? Could you clarify how this would work
> > in practice: is your assumption that if there are two identical blank
> > nodes (like in your example) then the system could delete one of them?
> > What about cases where the two blank nodes have slight differences -
> > would this also be covered and how? Is this about automatically fixing
> > constraint violations?
>
> This wasn't about repairing the data, merely identifying a conformant
> dataset over which SPARQL queries can be executed without exhaustive error
> checking. The example I provided would be pretty trivial to repair (I edited
> it to clarify that it's a simplification), but there are lots of ways the
> data can be broken and executing rules to normalize that data requires
> serious babysitting, and would generally be decoupled from analysis. Medical
> record custodians are typically risk-adverse and researchers are typically
> happy with representative subsets of the data. The same validation can be
> used by the custodians, if they every decide they'd like to clean up.
>
>
> > Thanks for clarification
> > Holger
> >
> >
>
> --
> -ericP
>
> office: +1.617.599.3509
> mobile: +33.6.80.80.35.59
>
> (eric@w3.org)
> Feel free to forward this message to any list for any purpose other than
> email address distribution.
>
> There are subtle nuances encoded in font variation and clever layout which
> can only be seen by printing this message on high-clay paper.
>
>

Received on Tuesday, 25 November 2014 07:37:20 UTC