Re: Addressing ISSUE-47 (invalid and relative IRIs) from David McNeil on 2011-07-08 (public-rdb2rdf-wg@w3.org from July 2011)

From: David McNeil <dmcneil@revelytix.com>
Date: Fri, 8 Jul 2011 12:10:08 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <CA+8VvdxHNCrbm+CWAjAmi9UxNNNmqMPoaSMpUBaORdbPvHSThg@mail.gmail.com>
Richard-

On Fri, Jul 8, 2011 at 10:35 AM, Richard Cyganiak <richard@cyganiak.de>wrote:

> I don't quite agree with this comparison. Swallowing exceptions may leave
> the application in an undefined state. Not mapping invalid data is a
> well-defined behaviour. It's more like silently catching and ignoring an
> exception at runtime.
>

That is exactly what I meant by "swallowing exceptions".


> > Because of my background and the product I am working on I am more
> concerned with the second use case. Driven by this I would say that for
> ISSUE-47 and ISSUE-51 the R2RML implementation should simply generate these
> triples and pass them downstream.
>
> This way, you're silently letting broken data into your system, which will
> likely blow up later somewhere in the pipeline.
>

Right, as a developer that is what I want. Then I will fix the blow ups and
make it all work or change the mapping to suppress the data (if that is what
is appropriate to do). As you point out below it won't necessarily be
"silent".


> Jena excepts invalid IRIs when reading RDF, but throws an exception when
> writing them. Sesame will throw exceptions for *some* invalid IRIs, not
> others. Jena accepts broken typed literals like "aaa"^^xsd:integer. Sesame
> throw exceptions for them unless you specifically configure it not to.
>
>

The point is that invalid IRIs *are not* IRIs, and hence the generated
> triples would *not be* RDF triples, and the result would *not be* an RDF
> graph or RDF dataset, and I wouldn't want to specify that R2RML generates
> RDF datasets that violate the RDF and SPARQL specs.
>

It seems to me that R2RML defines a mapping layer and it is certainly
possible to give it "bad" inputs that produce "bad" outputs. For example,
users could use a column as an IRI, but that column might contain "bad"
data. I don't think this means that R2RML has failed. It seems to me that
what you are proposing would be specifying that R2RML generates RDF datasets
that contain _some_ of the relational data. That doesn't strike me as
correct. How would you explain it to users when they complain that some of
their rows were dropped from the output?

We could perhaps define a separate notion of a "data error". Data errors
> don't cause a mapping to be invalid, and would still not generate any
> triples. But R2RML processors MAY offer an option to scan the database for
> data errors. Generating invalid IRIs or ill-typed literals would be data
> errors.
>

This strikes me as worth considering.



> > This thinking also causes me to reconsider silently suppressing rows null
> values in template expressions.
>
> This is NULL and silently suppressed:
>
>    [] rr:column "'asdf/'||id";
>
> Why do you expect this to be different?
>
>    [] rr:template "asdf/{id}";
>

I can't quite make sense of what you are asking here. In particular I am
having trouble parsing "'asdf/'||id" as a column name.

 In D2RQ, {Name} doesn't do %-encoding, but {Name|urlencode} does. This
> works fine, and I could support a solution along these lines.
>
> The only problem is that users often forget it, which often works fine on
> the first few examples they try, but fails on the occasional string that
> contains a funny character.
>
> I would still like to know the use case for having {Name} in an IRI
> template not do percent-encoding.
>

Have you seen from your D2RQ experience a use for this?


> (I note here that the {Name} syntax is inspired by URI Templates [1], and
> that it percent-encodes by default, but you can switch it partially off by
> doing {+Name}. I didn't really read the URI Templates spec in detail.)
>
> [1] http://tools.ietf.org/html/draft-gregorio-uritemplate-04
>
>
Perhaps we can leverage more of the URI Templates spec? I skimmed through
some of it but it was not obvious if there was a way to completely turn off
percent encoding.


> > On a related issue, I think that as we introduce %-encoding on the
> mapping side we need to define how the inverse operation is performed in
> inverse expressions.
>
> I think the reversing is an implementation detail that doesn't have to be
> defined in the spec. The spec just has to define the triples in the virtual
> output dataset, and has to define what a correct rr:inverseExpression is.
> How to use the inverseExpression is up to implementers. (That being said,
> there's nothing hard about reversing the percent-encoding, and it's
> deterministic.)
>

Hmm... I am not sure about this.  Taking the example from the R2RML spec, if
we wanted to add percent decoding to this:

 rr:inverseExpression "{deptno} = substr({deptId},length('Department')+1)"


Would the user do something like this (suppose the deptno column was percent
encoded):

 rr:inverseExpression "{deptno} =
urldecode(substr({deptId},length('Department')+1))"


-David
Received on Friday, 8 July 2011 17:10:36 UTC