Re: Addressing ISSUE-47 (invalid and relative IRIs)

Richard,

I agree with David's points. We too have similar concerns:
Data is too big and too dynamic to expect full static validation of 
R2RML wrt data.
Run time error is acceptable given these constraints.

Let us continue this discussion in emails and if time permits, in the 
telecon as well.

Thanks,
- Souri.

David McNeil wrote:
> Richard- It seems to me that we must have different usage scenarios in 
> mind. The real use cases I am dealing with have these properties:
>
> * the database being mapped is too large to check every row for 
> validity in batch mode
> * the data is constantly changing so even if we did check it in batch 
> mode, we would need to start again as soon as we were done
> * "the data" is not even available at mapping development time, so it 
> is not possible for the mapping to be tested with all of the data
> * no data can be dropped. It is better to throw an error than to 
> produce incomplete (i.e. wrong) results.
>
> From my perspective these requirements are very typical of a 
> transaction processing application built on a database. Say you have a 
> web app that allows users to make travel arrangements, how could you 
> ever know that the application produced correct results without 
> checking it on every possible set of data values that users could 
> enter? It makes little sense to talk about testing the application on 
> every possible input before you declared the application "valid" and 
> ready to deploy. Rather you rely on identifying the types of data 
> variations that can occur in the input data and creating tests for 
> examples of each of the types of variations that you can identify. For 
> example, does the mapping work if there is whitespace in a data 
> column, how about special characters, unicode, etc. If the developers 
> are thorough, experienced, and diligent in testing the error rate for 
> users can be made quite low. For the use cases I am involved with, 
> creating the R2RML mappings is very similar to this kind of 
> application development with respect to "proving" the mapping valid 
> over the data.
>
> With respect to dropping data, for the use cases I am working with, it 
> is completely not an option for the R2RML processor to drop the data. 
> Think of something like these purely made up examples that illustrate 
> the reality of the apps I deal with:
>  
> * a query to determine if too many toxins are present in the drinking 
> water. Data cannot be silently dropped.
> * a query to determine if a jet liner part has been subject to too 
> much stress and needs to be replaced. Data cannot be silently dropped.
> * a query to determine if a person has any money left in their bank 
> account. Data cannot be silently dropped.
>
> I recognize that there are cases where all of the data could be 
> validated once at startup and then the mapping declared "valid". I 
> also recognize that there are cases where it would be acceptable to 
> drop rows of data that do not produce valid RDF. However, I believe 
> these must be treated as special cases and that R2RML needs to be 
> general enough to support the cases I described above. Furthermore I 
> don't think the default R2RML behavior should cater to these special 
> cases by dropping data by default.
>
> I hope that makes it more clear the kind of R2RML usage patterns that 
> my organization is facing.
>
> Again, thanks for the ongoing discussion.
> -David

Received on Tuesday, 12 July 2011 14:08:19 UTC