Re: Addressing ISSUE-47 (invalid and relative IRIs) from Richard Cyganiak on 2011-07-08 (public-rdb2rdf-wg@w3.org from July 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Fri, 8 Jul 2011 19:43:48 +0100
To: David McNeil <dmcneil@revelytix.com>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-Id: <66E947B7-C127-40B0-8360-D91933968F7B@cyganiak.de>
On 8 Jul 2011, at 18:10, David McNeil wrote:
>> This way, you're silently letting broken data into your system, which will likely blow up later somewhere in the pipeline.
> 
> Right, as a developer that is what I want. Then I will fix the blow ups and make it all work or change the mapping to suppress the data (if that is what is appropriate to do). As you point out below it won't necessarily be "silent".

It is reasonable to expect the R2RML processor to flag the error. It is not reasonable to expect the R2RML processor to silently produce broken data.

Here's what would happen.

Imagine an ETL-style R2RML processor that dumps to N-Triples. So you dump your database to N-Triples, which takes 24 hours. You ship the dump off to another part of the organization, who load it into their triple store, which takes 48 hours. Except it fails after 40 hours because the N-Triples file is invalid because it contains broken IRIs. You get an angry call. You fix your mapping, and re-run the export. It takes another 24 hours. You ship the new dump off, and they import it. Except it fails after 42 hours because of another broken IRI from another column... You see where I'm going with this. Let's not write a spec that expects implementations to output broken data.

(I can also see the downside of the scenario where you get an angry call after two weeks when it turned out in an important demo to a customer that 20% of the data is missing. So I take your point that the “silently drop it” approach has a problem; but I don't agree that this proposed solution is acceptable.)

> It seems to me that R2RML defines a mapping layer and it is certainly possible to give it "bad" inputs that produce "bad" outputs.

"bad" outputs are *bad*. “Trash in, trash out” is no excuse here. The contract of R2RML is that the output is an RDF dataset, and if there's an invalid IRI in it then it is not an RDF dataset and the contract is violated. Especially for large datasets this is bad.

A story from the early days of DBpedia. DBpedia is very much a “trash in, trash out” operation. You wouldn't believe the kind of rubbish that the parser has to deal with in some of the 5 million or so Wikipedia pages. Our early N-Triples dumps had all sorts of problems in them -- invalid IRIs, bad Unicode literals, and so on. After each new release we'd get complaints as people tried to load them into different stores, which choked on different things. Eventually, we wrote an N-Triples serializer that fixed up anything that didn't slavishly follow the specs. The complaints stopped. I remember one complaint afterwards which turned out to be a bug in the Virtuoso N-Triples parser rather than in our code. (They fixed it quickly.)

>> We could perhaps define a separate notion of a "data error". Data errors don't cause a mapping to be invalid, and would still not generate any triples. But R2RML processors MAY offer an option to scan the database for data errors. Generating invalid IRIs or ill-typed literals would be data errors.
>> 
> 
> This strikes me as worth considering.

So how about this:

[[
A DATA ERROR is a condition of the data in the input database that would lead to the generation of invalid RDF data. R2RML processors handle data errors by silently not creating the offending triples, and the presence of data errors does not make an R2RML mapping non-conforming. Data errors cannot generally be detected by analyzing the table schema of the database, but only by scanning the data in the tables. An R2RML DATA VALIDATOR is a system that checks an input database for the presence of data errors.

A R2RML DATA VALIDATOR is a system that takes as its input an R2RML mapping, a base IRI, and a SQL connection to an input database, and checks for presence of DATA ERRORS in the input database. When checking the database, a data validator MUST report any DATA ERRORS that occur in the process of generating the output dataset.

If the value generated from a term map with term type “IRI” is not a valid IRI, then no RDF term is generated. This is a DATA ERROR.

If the value generated from a term map with term type “Literal” is a typed literal whose datatype IRI is a supported datatype, and whose lexical form is not in the lexical space of the datatype, then no RDF term is generated. This is a DATA ERROR.
]]

So the spec would define one more kind of artefact, the R2RML data validator, in addition to the ones already specified (R2RML processor, R2RML mapping document, R2RML mapping graph).

>>> This thinking also causes me to reconsider silently suppressing rows null values in template expressions.
>>> 
>> 
>> 
>> This is NULL and silently suppressed:
>> 
>>    [] rr:column "'asdf/'||id";
>> 
>> Why do you expect this to be different?
>> 
>>    [] rr:template "asdf/{id}";
>> 
> 
> I can't quite make sense of what you are asking here. In particular I am having trouble parsing "'asdf/'||id" as a column name. 

Sorry, what I meant was to contrast these options:

    [] rr:logicalTable [ rr:sqlQuery "SELECT *, 'asdf/' || id AS uri FROM table" ];
       rr:subjectMap [ rr:column "uri" ].

    [] rr:logicalTable [ rr:tableName "table" ];
       rr:subjectMap [ rr:template "asdf/{id}" ].

In the absence of NULLs (and ignoring the question of percent-encoding), both are equivalent. And the former silently suppresses NULLs because concatenating a string with NULL is NULL. So I'd expect the second to do the same. (Any expression involving NULL is NULL, except magic ones like IS NULL or COALESCE.)

>> I would still like to know the use case for having {Name} in an IRI template not do percent-encoding.
>> 
> 
> Have you seen from your D2RQ experience a use for this? 

I can remember one. Mapping the Wordpress database schema, where you have URI fragments such as "/2011/07/07/my-blogpost-title" in the DB, and I wanted to generate a URI from that. Percent-encoding would encode the slashes, which would yield an incorrect final URI. This certainly isn't the common case.

In R2RML this situation is easily solved by just using a SQL query to do the concatenation, rather than a template which would do percent-encoding. so I find it's fine to always percent-encode in templates.

>> Perhaps we can leverage more of the URI Templates spec?

Probably.

>> I skimmed through some of it but it was not obvious if there was a way to completely turn off percent encoding. 
>> 

Why would you want to turn off percent-encoding completely when generating IRIs?

> Taking the example from the R2RML spec, if we wanted to add percent decoding to this:
> 
>  rr:inverseExpression "{deptno} = substr({deptId},length('Department')+1)"
> 
> Would the user do something like this (suppose the deptno column was percent encoded):
> 
>  rr:inverseExpression "{deptno} = urldecode(substr({deptId},length('Department')+1))"

No, you don't have to change the inverseExpression at all.

We have two transformations when generating RDF using an rr:sqlQuery:

  Values in base table ==1==> values in logical table ==2==> RDF terms

rr:template and percent-encoding are part of step 2. Step 2 is designed so that it is always reversible given the information provided by the user.

rr:inverseExpression is only about reversing step 1.

Best
Richard
Received on Friday, 8 July 2011 18:44:18 UTC