RE: Question on User Story S33: Normalizing data

On Nov 25, 2014 4:20 AM, "Irene Polikoff" <irene@topquadrant.com> wrote:
>
> So, is this about validating that the assessment test with two results is
OK
> as long as both results have the same coding term and the same assessor?
And
> not about reshaping the data?

In my use case, any data that did not pass the designated shape would be
rejected. That includes the example broken data.

> Or is it about actually changing the data to collapse the two results
into a
> single one along the lines of pre-processing before the validation as
Karen
> may be suggesting?

Where mine is about selecting conformant data, Karen's is probably more
easily accomplished by selecting non-conformant data, which could then be
normalized via various heuristics and rules.

> -----Original Message-----
> From: Eric Prud'hommeaux [mailto:eric@w3.org]
> Sent: Monday, November 24, 2014 7:56 AM
> To: Holger Knublauch
> Cc: public-data-shapes-wg@w3.org
> Subject: Re: Question on User Story S33: Normalizing data
>
> * Holger Knublauch <holger@topquadrant.com> [2014-11-21 09:38+1000]
> > Hi Eric,
> >
> > I have a question on the User Story S33 that you added recently:
> >
> > https://www.w3.org/2014/data-shapes/wiki/User_Stories#S33:_Normalizing
> > _data_patterns_for_simple_query
> >
> > You describe the requirement to normalize data - I guess automatically
> > to drop extra duplicate entries? Could you clarify how this would work
> > in practice: is your assumption that if there are two identical blank
> > nodes (like in your example) then the system could delete one of them?
> > What about cases where the two blank nodes have slight differences -
> > would this also be covered and how? Is this about automatically fixing
> > constraint violations?
>
> This wasn't about repairing the data, merely identifying a conformant
> dataset over which SPARQL queries can be executed without exhaustive error
> checking. The example I provided would be pretty trivial to repair (I
edited
> it to clarify that it's a simplification), but there are lots of ways the
> data can be broken and executing rules to normalize that data requires
> serious babysitting, and would generally be decoupled from analysis.
Medical
> record custodians are typically risk-adverse and researchers are typically
> happy with representative subsets of the data. The same validation can be
> used by the custodians, if they every decide they'd like to clean up.
>
>
> > Thanks for clarification
> > Holger
> >
> >
>
> --
> -ericP
>
> office: +1.617.599.3509
> mobile: +33.6.80.80.35.59
>
> (eric@w3.org)
> Feel free to forward this message to any list for any purpose other than
> email address distribution.
>
> There are subtle nuances encoded in font variation and clever layout which
> can only be seen by printing this message on high-clay paper.
>
>

Received on Tuesday, 25 November 2014 06:49:50 UTC