Direct Mapping Spec - Comments from David McNeil on 2011-08-10 (public-rdb2rdf-wg@w3.org from August 2011)

From: David McNeil <dmcneil@revelytix.com>
Date: Wed, 10 Aug 2011 10:12:20 -0500
To: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <CA+8VvdwQsOqH3MMWQfNXbKdb-Q_rETFad0axgErbSb8-BrumQg@mail.gmail.com>
I read the latest Direct Mapping spec [1] (only skimmed Appendices). Below,
identified by section number, are the comments I had while reading it.

-David

[1] http://www.w3.org/2001/sw/rdb2rdf/directMapping/EGP

====

1 - "intended to provide a default behavior for R2RML" - It might be worth
reconsidering the wording of this to avoid implying that R2RML prescribes
this as default behavior.

1 - "It can be also used" - awkward sentence structure

2 - Wrong URL for RFC3987 link.

2 - I found the sudden transition to talking about FKs to be a bit jarring.
Maybe there is a way to make this flow better?

2 - "This graph is composed of relative IRIs" - I know this has been
discussed on the mailing list, but this is non-standard, eh? Isn't IRI
prefixing a serialization issue? Also, does the user provide the base IRI?
As I recall a goal was for the direct mapping to run without any user
configurable options beyond pointing it at a database.

2.1 - For clarity, the "People" PRIMARY KEY clause should not be on the same
line as the "addr" field.

2.1 - Per standard SQL, I think the string literals in the INSERT statements
should have single quotes so they are not interpreted as identifiers.

2.1 - I think using the first row of a table for DB metadata is confusing
given the widely understood model of having the column names as the first
row. Especially considering that the fonts are the same. Maybe if the
metadata were in non-bold italics it would be easier to read?

2.2 - "compound and composite" - At first glance this seems redundant.

2.2 - "People tables's" - It is still early, but that can't be the right
apostrophe use?

2.2 - "The referent identifier (object of the above predicate)" - For
clarity, I would just say "the object"

2.2 - +1 to Souri's observation that this approach does not handle multiple
foreign keys from the same columns.

2.2 - ":(deptName, deptCity) is a multi-column foreign key in the table
People which references the multi-column candidate key (name, city) in the
table Department." This is awkward to read and is just a repetition of the
formal FK definition in the DDL. I would omit it.

2.3 - I realized that I wasn't sure if I was reading the spec, or reading an
example. It seems to me that the text needs to be more clearly identified,
on a paragraph basis as to whether it is an informal description of the spec
or a concrete example. For example, the R2RML spec highlights examples with
an alternate color and a surrounding label/box. Personally I think I would
swap the order of sections 3 & 2 or intersperse the examples from section 2
into section 3.

2.3 "would have been generated" - Seems the text is clearer to read if we
can stick to a more active voice.

2.4 "(for keeping track of tweets in Twitter)" - I would find a way to
remove the parens.

2.4 "It is not possible to dereference blank nodes" - I don't immediately
see what the point of this statement is.

2.5 - I suspect this has been discussed at great length in the past, but
from my perspective the way blank nodes identifiers are used in this example
seems to create implementation pain. In particular the processing of a row
in a table is not simply a function of that row. Rather, it must access the
"global" list of what blank node identifier is used for each database value
that is used as an FK to a non-PK. For this reason, the way we solved this
problem at Revelytix was to use the data value itself to form the
identifier. I think this applies whether an IRI or a blank node is used to
identify the PK-less row.

3 - At Revelytix we have found it useful to define two base URIs: one for
ontology URIs and another for data

3 - "all labels are generated by appending to a base." - I think someone
else mentioned this already, but it seems referring to the IRIs as "labels"
is confusing and we should use more precise words here.

3 - "the percent-encoded form of the column value" - This presupposes a text
representation of the column value. Is it specified elsewhere how to get a
text representation?

3 - "fresh blank node" - Personally, seems ok to me, but do we need more
precise words for this?

3 - "A (potentially unary)" - I encountered several places like this where I
found the parens distracting.

3 - "Definition property IRI:" At one point I found myself mis-reading this
as a definition of the term "definition property IRI". The R2RML spec seems
to define terms more clearly with a formatted construct like: "A _data
error_ is a condition of the data in the"

A.1 - I think the English Syntax should be shown by default.

A & B - I stopped reading it closely, but (at the risk of stating the
obvious and of stirring up previous compromises) it seems like an
over-abundance of notations. Truly it is hard to tell how many of them there
are and seems it will be challenging to keep them all in synch as the spec
evolves. I would remove some of them.

Other thoughts, perhaps these have been addressed in past discussions
already and I just don't know the answers:

 * do we need to say anything about how a direct mapping generator finds a
database?

 * do we need to say anything about which schema to map?

 * how about synonyms in the database? We have found this to be a pain point
in practice.

 * does it need a mechanism for omitting the schema tables from the mapping?

 * I notice the spec is silent about case sensitivity of database
identifiers. I suppose it is implied that the casing used in the database
metadata is preserved?
Received on Wednesday, 10 August 2011 15:12:57 UTC