Re: Addressing ISSUE-47 (invalid and relative IRIs) from David McNeil on 2011-07-11 (public-rdb2rdf-wg@w3.org from July 2011)

From: David McNeil <dmcneil@revelytix.com>
Date: Mon, 11 Jul 2011 10:03:22 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <CA+8Vvdzdo9sXz8ngf0ze3B7h-g8gph1j1MMfdYMDeXDmMi5XTg@mail.gmail.com>
Richard - I appreciate the ongoing discussion. I find it helpful, I hope it
is useful to you as well.

It is reasonable to expect the R2RML processor to flag the error. It is not
> reasonable to expect the R2RML processor to silently produce broken data.
>

<aside>
I think part of my mental model is that the an R2RML processor could be an
intermediate stage in an overall ETL-style pipeline. As an intermediate
stage it could be producing an intermediate form that requires additional
processing before it is valid RDF. Such a model could be useful either as a
stage in a pipeline or if the data will be used by more relaxed tools.
However, I realize this is not what the working group has been charged to
create, so I will try to set that aside from my thinking.
</aside>

"bad" outputs are *bad*. “Trash in, trash out” is no excuse here. The
> contract of R2RML is that the output is an RDF dataset, and if there's an
> invalid IRI in it then it is not an RDF dataset and the contract is
> violated. Especially for large datasets this is bad.
>

What are your thoughts on a usage pattern where a mapping is defined, SPARQL
is executed against the mapping, and only column values are returned. The
resource identifiers are used internally by the query processing, but are
not produced as an output. Is there any merit in relaxing the requirements
for valid IRIs in this case?

So the spec would define one more kind of artefact, the R2RML data
> validator, in addition to the ones already specified (R2RML processor, R2RML
> mapping document, R2RML mapping graph).
>

This seems to have merit for consideration. I think the data validator spec
would need to accommodate usage modes other than the batch processing mode
that you describe. If the triples defined by the mapping are not
materialized, but queried as virtual triples, then it seems we should allow
the data validator to be executed at query-time on the virtual triples
accessed by the query.

In R2RML this situation is easily solved by just using a SQL query to do the
> concatenation, rather than a template which would do percent-encoding. so I
> find it's fine to always percent-encode in templates.
>

So column references would never percent-encode, templates always would. If
the user wants to build a URI from pre-encoded parts they would define it as
part of a logical table (i.e. a SQLQuery in the mapping), reference the
resulting column in a term map, and R2RML would not attempt to re-encode the
column. I am trying to make sure I understand your suggestion.

Why would you want to turn off percent-encoding completely when generating
> IRIs?
>

For cases (like the WordPress example) where snippets of URLs are pre-built
in the database columns. This could mean the columns contain URL separators
or they are already URL-encoded. These column values could be in the
underlying data or in the columns of logical tables defined in the mapping
itself. I think this is a valid usage pattern that we would need to support.
If I understand your position you would say R2RML can accommodate this
because the user can always define a SQL query to produce the IRI and thus
avoid the automatic URL-encoding of the R2RML templates? I can understand
this position, although personally I would prefer to define a way for the
user to control the URL-encoding performed by templates. However, I can see
that we might declare this to be a post-R2RML-1.0 feature/


> We have two transformations when generating RDF using an rr:sqlQuery:
>
>  Values in base table ==1==> values in logical table ==2==> RDF terms
>
> rr:template and percent-encoding are part of step 2. Step 2 is designed so
> that it is always reversible given the information provided by the user.
>
> rr:inverseExpression is only about reversing step 1.
>

Perhaps I am confused, but I am not able to match this description to my
understanding of inverseExpression.

If I have a mapping that produces IRI's like: http://John%20Smith from a
database column with values like "John Smith", then I would expect to be
able to write a SPARQL query that selects data for http://John%20Smith.
Furthermore I would expect to be able to write an inverseExpression that
allowed the R2RML processor to deconstruct http://John%20Smith and obtain
the original data value of "John Smith". When I write the inverseExpression
it seems that I need to know which parts of the IRI were pre-URL-encoded
(i.e. the data value in the database is URL-encoded) and which were
URL-encoded by the mapping, so that I could URL-decode the write parts.

-David
Received on Monday, 11 July 2011 15:03:49 UTC