Re: Addressing ISSUE-47 (invalid and relative IRIs) from David McNeil on 2011-07-12 (public-rdb2rdf-wg@w3.org from July 2011)

From: David McNeil <dmcneil@revelytix.com>
Date: Tue, 12 Jul 2011 08:41:30 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <CA+8VvdxxwDzA67fm3CkoYBSkZzD4e=Voojw1jife8GV+m471ew@mail.gmail.com>

Richard- It seems to me that we must have different usage scenarios in mind.
The real use cases I am dealing with have these properties:

* the database being mapped is too large to check every row for validity in
batch mode
* the data is constantly changing so even if we did check it in batch mode,
we would need to start again as soon as we were done
* "the data" is not even available at mapping development time, so it is not
possible for the mapping to be tested with all of the data
* no data can be dropped. It is better to throw an error than to produce
incomplete (i.e. wrong) results.

>From my perspective these requirements are very typical of a transaction
processing application built on a database. Say you have a web app that
allows users to make travel arrangements, how could you ever know that the
application produced correct results without checking it on every possible
set of data values that users could enter? It makes little sense to talk
about testing the application on every possible input before you declared
the application "valid" and ready to deploy. Rather you rely on identifying
the types of data variations that can occur in the input data and creating
tests for examples of each of the types of variations that you can identify.
For example, does the mapping work if there is whitespace in a data column,
how about special characters, unicode, etc. If the developers are thorough,
experienced, and diligent in testing the error rate for users can be made
quite low. For the use cases I am involved with, creating the R2RML mappings
is very similar to this kind of application development with respect to
"proving" the mapping valid over the data.

With respect to dropping data, for the use cases I am working with, it is
completely not an option for the R2RML processor to drop the data. Think of
something like these purely made up examples that illustrate the reality of
the apps I deal with:

* a query to determine if too many toxins are present in the drinking water.
Data cannot be silently dropped.
* a query to determine if a jet liner part has been subject to too much
stress and needs to be replaced. Data cannot be silently dropped.
* a query to determine if a person has any money left in their bank account.
Data cannot be silently dropped.

I recognize that there are cases where all of the data could be validated
once at startup and then the mapping declared "valid". I also recognize that
there are cases where it would be acceptable to drop rows of data that do
not produce valid RDF. However, I believe these must be treated as special
cases and that R2RML needs to be general enough to support the cases I
described above. Furthermore I don't think the default R2RML behavior should
cater to these special cases by dropping data by default.

I hope that makes it more clear the kind of R2RML usage patterns that my
organization is facing.

Again, thanks for the ongoing discussion.
-David

Received on Tuesday, 12 July 2011 13:41:58 UTC