RE: AW: Thoughts on validation requirements from Irene Polikoff on 2014-07-30 (public-rdf-shapes@w3.org from July 2014)

From: Irene Polikoff <irene@topquadrant.com>
Date: Wed, 30 Jul 2014 11:45:11 -0400
To: "'Eric Prud'hommeaux'" <eric@w3.org>, "'Peter F. Patel-Schneider'" <pfpschneider@gmail.com>
Cc: <public-rdf-shapes@w3.org>, "'Dimitris Kontokostas'" <kontokostas@informatik.uni-leipzig.de>, "'Bosch, Thomas'" <Thomas.Bosch@gesis.org>
Message-ID: <017001cfac0d$3faf2930$bf0d7b90$@topquadrant.com>
Hi Eric,

<snip> Why do we neet to attach it to a type? Wouldn't that mean that every
reusable object would have to have a bunch of types attempting to predict
all of the ways that data might be used? For instance, would the admitting
physician need to have type arcs asserting that he/she was a
bethIsreal:SurgicalPhysician, bethIsreal:EDAdmittingPhysician,
BOSchildrens:Surgeon, mgh:ThoracicSurgeon, mgh:AdmittingPhysician?

I'd expect that the physician's record should only advertise the type arcs
that are part of some shared ontology:
  <Pat> a foaf:Person , clin:Physician .
If the type arcs are only notionally attached to the data for the purposes
of verification, then the argument that they need to be types is circular;
they're only there because some verification system thinks in terms of
types. </snip>

I expect that every organization would have its own data validation
constraints. So, those for Boston Children will be different from those from
Beth Israel. Depending on the receiving application(s), they are also likely
to have different constraints even for the same classes or for subclasses of
the same canonical class. Managing these as an ontology or a set of
ontologies/RDF graphs has very important advantages. Data validation rules
are part of the overall enterprise data governance/master data management
space. In each enterprise, these can get quite complex requiring a way to
manage the definitions of the constraints, query for them, see which one is
used for which system, how they connect and relate to each other, etc. 

Organizing principles and systems become very important and classes in the
ontology models provide anchoring points. Without something like this, it
doesn't scale and in a presence of any complexity and changes in
requirements as they evolve, is likely to become a mess of disconnected
files that are very difficult/expensive to maintain. Luckily, RDF provides
us with a very flexible foundation for such management tasks. Constraints
intended to be used with different systems can be contained within different
named graphs allowing them to be used individually when needed and as a
cohesive body of definitions when needed.

Within such architecture, Beth Israel, for example, could use a canonical
class such as clin:Physician to attach constraints to if this is how they
view, manage, define, store, etc. their data or they could create their own
Physician class or multiple different classes depending on the needs of
their different systems. They could hang validations that are common across
all systems (or multiple systems) higher up in the class hierarchy and keep
unique validations lower in the hierarchy, etc.

When the input data comes into the system, it is often not in RDF at all. It
may be in XML, JSON, spreadsheets or whatever. The first step is then to
auto-convert it to RDF. As part of this process, type triples get assigned
according to what information the system that consumes this data and needs
to have it validated is expecting. Then, the validations appropriate to the
types can be performed. If the data comes in as RDF and internal systems
that receive it don't understand clin:Physician, then the first step is
anyhow to transform it to the terms they understand and then apply
validations appropriate to these.

This is how the systems we at TQ and our customers have been implementing
operate. Many of these required complex data validations. We can report that
this approach has proven in practice to be highly effective, maintainable,
scalable in general and, especially, compared to alternatives.

Regards,

Irene

-----Original Message-----
From: Eric Prud'hommeaux [mailto:eric@w3.org] 
Sent: Wednesday, July 30, 2014 10:57 AM
To: Peter F. Patel-Schneider
Cc: public-rdf-shapes@w3.org; Dimitris Kontokostas; Bosch, Thomas
Subject: Re: AW: Thoughts on validation requirements

* Peter F. Patel-Schneider <pfpschneider@gmail.com> [2014-07-29 08:01-0700]
> On 07/29/2014 03:43 AM, Eric Prud'hommeaux wrote:
> >* Peter F. Patel-Schneider <pfpschneider@gmail.com> [2014-07-28 
> >07:54-0700]
> >>On 07/28/2014 02:20 AM, Eric Prud'hommeaux wrote:
> >>>On Jul 28, 2014 12:08 AM, "Peter F. Patel-Schneider" 
> >>><pfpschneider@gmail.com>
> >>>wrote:
> 
> [...]
> 
> >>An RDF document, on the other hand, almost invariably contains 
> >>multiple somethings, very often not arranged in a tree, and 
> >>sometimes even without any connection between them.  In RDF it is 
> >>generally permissable to have any sort of information, whereas XML 
> >>information is generally required to fit into what is expected.
> >
> >I agree, but fear this is a sort of selection bias.
> 
> Well obviously there is a bias towards using RDF for multiple 
> somethings, because RDF is good at that and other formats are not.
> Because of this virtuous bias, there is the concomitant bias that 
> there is relatively less RDF that is used for single somethings.
> There is, of course, nothing wrong with this so far.
> 
> It may be that because RDF is good for multiple somethings, some 
> people think that it is not good for single somethings.  If so, this 
> would be somewhat unfortunate.

Agreed, and that's probably a point that will require constant reminders,
though the cases I'm referring to use multiple somethings, see "Linked Data
Basic Profile 1.0 - Use Cases and Requirements".
<http://www.w3.org/Submission/2012/SUBM-ldbpucr-20120326/#usecases>
Below, Consider a HospitalTransferRecord from Clinic A to Clinic B.
This would incorporate a bunch of somethings like a target problem, vitals,
prescriptions, and a patient (well, more rigorously just a person
temporarily acting in the role of patient).


> However, this certainly doesn't mean that RDF validation should ignore 
> the common situation of multiple somethings, most or all with explicit 
> types.  Nor does it mean that RDF validation should be targeted 
> towards single untyped somethings.  To do either of these is to ignore 
> RDF's strengths.

I see the multiple somethings as a strong case for detaching the shape (the
way that a particular app is using these types) from the types themselves.
Even if Clinic A and Clinic B are in the same clincal network, they'll
capture different information about e.g. the admitting physician's
credentials. In OWL, one would probably capture these as anonymous
restrictions, e.g. ClinicB:AdminissionRecord:

  Class: ClinicB:AdmissionRecord
    SubClassOf: 
      clin:AdmissionRecord,
      clin:admitter only 
        ((clin:credential some (clin:authority only ({"AMA" , "GMC"})))
         and (clin:credential min 1 owl:Thing))


> So I remain very skeptical that ShEx is a viable start towards RDF 
> validation, as it appears to me to be targeted towards an uncommon use 
> of RDF and not easily extended to nicely cover the bulk of extant and 
> proposed RDF.
> 
> >Perhaps the
> >majority of LDP uses include a backend which is not a triple store 
> >(possibly SQL, possibly state stored in the position of a lightswitch 
> >on a wall). In these cases, the data one posts must be limited to the 
> >exact arrangement of somethings that the server expects or data will 
> >be (silently) dropped. I suspect that the majority of the business 
> >use cases on the horizon for RDF involve services that are not 
> >willing to store arbitrary triples.
> 
> Even if true this is at best an argument for validation that covers 
> all (local) triples.  It still doesn't get one from multiple 
> somethings to single somethings.  I'm also still skeptical that 
> covering all (local) triples is a good idea even here, as it would 
> prohibit, for example, extra information coming from a node belonging 
> to an unexpected (or maybe even expected) subtype.
> 
> >>Validation then should work differently in RDF than in XML.  My view 
> >>of RDF validation is determining whether the instances of a type 
> >>(not necessarily explicitly signalled by an rdf:type link) meet some 
> >>constraint, and that RDF validation generally involves multiple 
> >>types, often unrelated types.  I don't see how ShEx can do this, and 
> >>thus my questions as to how ShEx can do RDF validation.
> >
> >What if shapes were types? I think that would meet your definition.
> 
> Well, that's the method used in Stardog ICV, and in lots of work on 
> constraints over logical formalisms (including description logics).


I don't see ShEx has having a problem with multiple somethings. The ShExC
for the above ClinicB:AdmissionRecord could set licensing requirements on
the admitting physician and coding requirements on the principle complaint:

  ClinicB:AdmissionRecord {
    clin:admitter {
      clin:credential { clin:authority ("AMA" | "GMC") }+
    }
    clin:principleComplaint {
      hl7:coding { hl7:CD.CodingSystem ("SNOMED" | "LOINC") }
    }+
  }


> However, just making shapes be types doesn't immediately get one from 
> ShEx to something that can nicely handle multiple somethings in RDF.  
> One also needs machinery to require that each instance of a particular 
> type must match a particular constraint type.

Why do we neet to attach it to a type? Wouldn't that mean that every
reusable object would have to have a bunch of types attempting to predict
all of the ways that data might be used? For instance, would the admitting
physician need to have type arcs asserting that he/she was a
bethIsreal:SurgicalPhysician, bethIsreal:EDAdmittingPhysician,
BOSchildrens:Surgeon, mgh:ThoracicSurgeon, mgh:AdmittingPhysician?

I'd expect that the physician's record should only advertise the type arcs
that are part of some shared ontology:
  <Pat> a foaf:Person , clin:Physician .
If the type arcs are only notionally attached to the data for the purposes
of verification, then the argument that they need to be types is circular;
they're only there because some verification system thinks in terms of
types.


> >There's some language (ShEx, Resource Shapes, Description Set 
> >Profiles or something else whose name I can't recall) to verify that 
> >a node in an instance graph matches a declared structure in a schema. 
> >Some mechanism like oslc:resourceShape associates a graph node with 
> >that structure. Does that fit your view?
> 
> Maybe.  I'm not sure how Resource Shapes 2.0 works, as the description 
> is very loose.  It does appear that typed shapes are what is intended 
> to be used for what I think of as the usual case of RDF validation - 
> requiring that instances of a class have a particular shape.  However, 
> some aspects of Resource Shapes 2.0 appear to be inimical to type 
> hierarchies.

It seems like predicates like oslc:resourceShape give us the duck typing
that we need to get practical interoperability out of our reusable
somethings.


> peter

--
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout which
can only be seen by printing this message on high-clay paper.
Received on Wednesday, 30 July 2014 15:46:02 UTC