Re: Addressing ISSUE-47 (invalid and relative IRIs) from Richard Cyganiak on 2011-07-12 (public-rdb2rdf-wg@w3.org from July 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Tue, 12 Jul 2011 22:06:09 +0100
To: David McNeil <dmcneil@revelytix.com>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-Id: <18366A17-88C6-47C9-A118-4FBF87DED9D7@cyganiak.de>
David,

Well, then how about this updated proposal:

[[
A DATA ERROR is a condition of the data in the input database that would lead to the generation of an invalid RDF term, such as an invalid IRI or an ill-typed literal. 

When providing access to the output dataset, an R2RML processor MUST abort any operation that requires inspecting or returning an RDF term whose generation would give rise to a data error, and report an error to the agent invoking the operation. A conforming R2RML processor MAY however allow other operations that do not require inspecting or returning these RDF terms, and thus MAY provide partial access to an output dataset that contains data errors. Nevertheless, an R2RML processor SHOULD report data errors as early as possible.

The presence of data errors does not make an R2RML mapping non-conforming.

Informative note: Data errors cannot generally be detected by analyzing the table schema of the database, but only by scanning the data in the tables. For large and rapidly changing databases, this can be an expensive or even impossible operation. Therefore, R2RML processors are allowed to answer queries that do not “touch” a data error, and the behavior of such operations is well-defined. For the same reason, the conformance of R2RML mappings is defined regardless of the presence of data errors.

A R2RML DATA VALIDATOR is a system that takes as its input an R2RML mapping, a base IRI, and a SQL connection to an input database, and checks for presence of data errors. When checking the input database, a data validator MUST report any DATA ERRORS that are raised in the process of generating the output dataset.

If the value generated from a term map with term type “IRI” is not a valid IRI, then a DATA ERROR is generated.

If the value generated from a term map with term type “Literal” is a typed literal whose datatype IRI is a supported datatype, and whose lexical form is not in the lexical space of the datatype, then a DATA ERROR is generated.
]]

Best,
Richard



On 12 Jul 2011, at 14:41, David McNeil wrote:

> Richard- It seems to me that we must have different usage scenarios in mind. The real use cases I am dealing with have these properties:
> 
> * the database being mapped is too large to check every row for validity in batch mode
> * the data is constantly changing so even if we did check it in batch mode, we would need to start again as soon as we were done
> * "the data" is not even available at mapping development time, so it is not possible for the mapping to be tested with all of the data
> * no data can be dropped. It is better to throw an error than to produce incomplete (i.e. wrong) results.
> 
> From my perspective these requirements are very typical of a transaction processing application built on a database. Say you have a web app that allows users to make travel arrangements, how could you ever know that the application produced correct results without checking it on every possible set of data values that users could enter? It makes little sense to talk about testing the application on every possible input before you declared the application "valid" and ready to deploy. Rather you rely on identifying the types of data variations that can occur in the input data and creating tests for examples of each of the types of variations that you can identify. For example, does the mapping work if there is whitespace in a data column, how about special characters, unicode, etc. If the developers are thorough, experienced, and diligent in testing the error rate for users can be made quite low. For the use cases I am involved with, creating the R2RML mappings is very similar to this kind of application development with respect to "proving" the mapping valid over the data.
> 
> With respect to dropping data, for the use cases I am working with, it is completely not an option for the R2RML processor to drop the data. Think of something like these purely made up examples that illustrate the reality of the apps I deal with:
>  
> * a query to determine if too many toxins are present in the drinking water. Data cannot be silently dropped.
> * a query to determine if a jet liner part has been subject to too much stress and needs to be replaced. Data cannot be silently dropped.
> * a query to determine if a person has any money left in their bank account. Data cannot be silently dropped.
> 
> I recognize that there are cases where all of the data could be validated once at startup and then the mapping declared "valid". I also recognize that there are cases where it would be acceptable to drop rows of data that do not produce valid RDF. However, I believe these must be treated as special cases and that R2RML needs to be general enough to support the cases I described above. Furthermore I don't think the default R2RML behavior should cater to these special cases by dropping data by default.
> 
> I hope that makes it more clear the kind of R2RML usage patterns that my organization is facing.
> 
> Again, thanks for the ongoing discussion.
> -David
Received on Tuesday, 12 July 2011 21:06:39 UTC