Re: Addressing ISSUE-47 (invalid and relative IRIs)

Hi David,

On 8 Jul 2011, at 14:30, David McNeil wrote:
> I see two different perspectives on the mapping issue.
> 
> 1) a relatively casual user wants to expose a relational database as RDF and want it to "just work".  I can see in this mode that it could make sense to just silently ignore rows that might cause trouble (e.g. rows with null values or rows that produce IRIs with spaces, or rows that produce text values that claim to be numbers.
> 

> 2) a software developer building an application that includes mapping a relational database to RDF. In this mode I think it is very troublesome for rows to just silently disappear from the output. This is like software silently swallowing exceptions (typically a bad practice that makes debugging much more difficult).

I don't quite agree with this comparison. Swallowing exceptions may leave the application in an undefined state. Not mapping invalid data is a well-defined behaviour. It's more like silently catching and ignoring an exception at runtime.

> Because of my background and the product I am working on I am more concerned with the second use case. Driven by this I would say that for ISSUE-47 and ISSUE-51 the R2RML implementation should simply generate these triples and pass them downstream.

This way, you're silently letting broken data into your system, which will likely blow up later somewhere in the pipeline.

Jena excepts invalid IRIs when reading RDF, but throws an exception when writing them. Sesame will throw exceptions for *some* invalid IRIs, not others. Jena accepts broken typed literals like "aaa"^^xsd:integer. Sesame throw exceptions for them unless you specifically configure it not to.

The point is that invalid IRIs *are not* IRIs, and hence the generated triples would *not be* RDF triples, and the result would *not be* an RDF graph or RDF dataset, and I wouldn't want to specify that R2RML generates RDF datasets that violate the RDF and SPARQL specs.

As I said, the only other option I can think of is raising an error. The problem with raising an error is that it likely would not be seen at startup time, because you don't want to scan 100 million rows for invalid data at startup time. So it would only be seen at query time, where the error isn't very useful -- it's likely to manifest as an exception that's seen by an unsuspecting user, not by the mapping author. At best it would end up in some log.

We could perhaps define a separate notion of a "data error". Data errors don't cause a mapping to be invalid, and would still not generate any triples. But R2RML processors MAY offer an option to scan the database for data errors. Generating invalid IRIs or ill-typed literals would be data errors.

> This thinking also causes me to reconsider silently suppressing rows null values in template expressions.

This is NULL and silently suppressed:

    [] rr:column "'asdf/'||id";

Why do you expect this to be different?

    [] rr:template "asdf/{id}";

> What if we made the %-encoding optional in templates? So for example this would not perform %-encoding:
>     rr:template "{Name}"
> 
> But this would:
>     rr:template "{%Name}"

In D2RQ, {Name} doesn't do %-encoding, but {Name|urlencode} does. This works fine, and I could support a solution along these lines.

The only problem is that users often forget it, which often works fine on the first few examples they try, but fails on the occasional string that contains a funny character.

I would still like to know the use case for having {Name} in an IRI template not do percent-encoding.

(I note here that the {Name} syntax is inspired by URI Templates [1], and that it percent-encodes by default, but you can switch it partially off by doing {+Name}. I didn't really read the URI Templates spec in detail.)

[1] http://tools.ietf.org/html/draft-gregorio-uritemplate-04

> On the last telecon we discussed defining functions for the user to invoke to perform %-encoding but there was some concern about making it more difficult to parse the mapping. 

I share these concerns.

> On a related issue, I think that as we introduce %-encoding on the mapping side we need to define how the inverse operation is performed in inverse expressions.

I think the reversing is an implementation detail that doesn't have to be defined in the spec. The spec just has to define the triples in the virtual output dataset, and has to define what a correct rr:inverseExpression is. How to use the inverseExpression is up to implementers. (That being said, there's nothing hard about reversing the percent-encoding, and it's deterministic.)

Best,
Richard

Received on Friday, 8 July 2011 15:36:05 UTC