R2RML - Last Call Comments from David McNeil on 2011-10-29 (public-rdb2rdf-comments@w3.org from October 2011)

From: David McNeil <dmcneil@revelytix.com>
Date: Sat, 29 Oct 2011 10:26:05 -0500
To: public-rdb2rdf-comments@w3.org
Message-ID: <CA+8VvdxKuaoU0uKNhcWRRami+0OjYN0-9=WyhDwAoJZ6uKUojw@mail.gmail.com>
I reviewed the Last Call version of the R2RML spec. My comments are below.

-David

====

Section 4

* "The base IRI MUST be a valid IRI. It SHOULD end in a slash (“/”)
character." - I would have thought this would say that it should end in an
IRI delimiter, rather than specifically referring to a slash.

* "Resolution of relative IRIs in R2RML uses simple string concatenation
instead of the more complex algorithm defined in RFC 3986. This ensures
that the original database value can be reconstructed from the generated
IRI." - I did not understand this note until I read the corresponding note
in section 11.2. I suppose that is because the latter note uses the phrase
"obtaining an absolute IRI from a relative IRI", which has a more obvious,
plain meaning than the phrase "resolution of relative IRIs". So I would
propose changing this note to use the phrasing from the note in section
11.2.

* "An R2RML processor MAY include an R2RML data validator, but this is not
required." - This seems to obscure the requirement from section 4.3 that an
R2RML processor must also detect the same errors that an R2RML data
validator detects.

Section 4.1

* "SHOULD NOT contain any mapping components (logical tables, term maps,
predicate-object maps) that are not referenced by a triples map (in other
words, are “unused”)" - I think the list of mapping components should
either be complete, or formally defined elsewhere. If I am reading it
correctly, it omits join condition and referencing object map. This seems
like a needless lack of clarity in the spec.

* "MAY assign IRIs or blank node identifiers to any mapping component in
order to enable re-use of mapping components within the mapping graph." - I
see that the term "mapping component" is used again here, so maybe it does
warrant a formal definition. I see the term "mapping constructs" is used
just below this. Are these the same thing?

* "For example, an IRI that represents a subject map may be used as the
subject map of multiple triples maps; and may even be used as an object map
of another triples map if it has the right properties." - I notice that
this is not an exhaustive list of how resources may be reused. Since it
says "For example", does this count as non-normative? I only raise these
questions. I am not enough of a spec-lawyer to know how best to handle this.

* "Using these classes is OPTIONAL in a mapping graph." - I think this
wording could lead to confusion as it could be read to mean that an R2RML
mapping could be defined without a TriplesMap. I think what is meant, as
the next sentence clarifies, is that the triples defining the types of the
mapping components are optional.

* Since it doesn't say that an R2RML mapping graph MUST contain at least
one TriplesMap, the implication is that an empty mapping graph that
produces no triples is valid R2RML. I just mention this to confirm that is
the intent.

* "The applicable class of a resource can always be inferred from its
properties." - I think this captures a requirement on implementors that
should be made more explicit. To get started processing an R2RML mapping,
implementors must infer that a resource is a TriplesMap based on the
presence of any of the mapping predicates which are required and have a
domain of TriplesMap. Once you have the TriplesMaps then the rest of the
types are implied by their position in the graph. This also raises the
question of whether it should be required that TriplesMaps resources be
explicitly identified as "a TriplesMap".

Section 4.2

* "The preferred file extension is .ttl." - Is preferred a weaker form of
SHOULD?

* "It is common to use document-local IRIs in mapping documents by defining
the default prefix in the beginning of the document, and using it for
creating IRIs for mapping components such as triples maps" - This note does
not seem like it belongs in the spec.

Section 4.3

* "A data error is a condition of the data in the input database that would
lead to the generation of an invalid RDF term, such as an invalid IRI or an
ill-typed literal." - This seems like quite a loose definition. Since we
have a more formal definition just below, can we strike this?

* "A term map with a datatype override produces an ill-typed literal of a
supported RDF datatype." Where "ill-typed" is defined in section 10 as "A
typed literal of a supported RDF datatype is ill-typed if its lexical form
is not in the lexical space of the RDF datatype identified by its datatype
IRI." - Doesn't this definition of a mandatory error contradict our stated
intent of: if data values fall outside of the intersection of SQL data
types and XSD data types then the behavior is undefined (but throwing an
error is not required)? I don't remember seeing a place in the spec that
captures our intent regarding the intersection of SQL and XSD data types.

* "A logical table is a possibly virtual database table that is to be
mapped to RDF triples. A logical table is either..." - I would remove the
loose definition of "a possibly virtual database table" and combine these
two sentences into one. The first time I read the sentence I thought it was
talking about database views and database tables.

Section 5.1

* "A SQL base tables or views is represented by a resource that has exactly
one rr:tableName property." - change "tables"->"table" and "views"->"view"

* "If no catalog or schema are specified" - change "are"->"is"

* "If no catalog or schema are specified, then the default catalog and
default schema of the SQL connection are assumed." - I think this is
potentially confusing because it could be read to suggest that the R2RML
implementor is supposed to do something wtih the default catalog/schema.
But in fact the intent is for the identifier to be passed unchanged across
the SQL connection.

Section 5.2

* "A SQL query is a SELECT query in the SQL language that can be executed
over the input database." - How about explicitly saying that it is a string
representation of a SELECT query?

* "It MUST be a valid SQL query if executed over the SQL connection." -
This could be read as "if a SQL query is executed over the SQL connection
then it must be valid". I think the intent is that a SQL query must be a
string that is valid to execute on the SQL connection as a query.

* "It MUST NOT have duplicate column names or unnamed derived columns in
the SELECT list." - It seems obvious now but when I first read this I had
trouble understanding what was meant by "unnamed derived columns". I think
it means if the query includes expressions as projected columns then they
must be named.

* "For any database objects referenced without an explicit catalog name or
schema name, the default catalog and default schema of the SQL connection
are used." - Same comment as from previous section. I think this statement
could be improved to clarify that an R2RML processor does not need to do
anything to accomplish using the default catalog/schema.

Section 6

* "specifies how to generate the subjects for each row of the logical
table" - I think this would be clearer if it said "generate a subject for
each row". Otherwise it sounds like it might be possible for a subject map
to generate multiple subjects from each row.

* "together with the subjects generated by the subject map" - Similar to
previous comment, changing "subjects"->"subject" I think removes ambiguity.

* "The referenced columns of all term maps of a triples map (subject map,
predicate maps, object maps, graph maps) MUST be column names that exist in
the term map's logical table." - Per the previous definition of "column
name" this implies that the values cannot be qualified. Is that the intent?
So a column reference in a mapping cannot be "EMP.JOB", it must be "JOB"?

Section 6.3

* "A predicate-object map is a function that creates predicate-object pairs
from logical table rows." - Perhaps this would be clearer if it said
"creates a predicate-object pair for each logical table row". This would
avoid ambiguity about how many predicate-object pairs are created per
logical table row.

Section 7.1

* "Occurrances of these properties..." - "Occurrances" -> "Occurrences"

* "aaa rr:subject bbb." - Using aaa and bbb as sample values makes the
table messy, IMO. Would "x" and "y" serve the same purpose but look cleaner?

Section 7.2

* "the singleton set containing the value of rr:column" - For clarity, I
think this should be changed to something like "... the value of _the
rr:column property_". Otherwise it could be read as describing a set
containing rr:column.

Section 7.3

* "Backslash characters MUST be escaped by doubling them with another
backslash character." - This reads ambiguously to me. It could be read as
saying to use two or three backslashes. This possibility for confusion is
compounded by the Turtle backslash handling. For clarity it could be
changed to something like "Backslash characters MUST be escaped by
preceding them with another backslash character."

* "If a template contains multiple pairs of unescaped curly braces, then
adjacent pairs SHOULD be separated..." - I don't think the word "adjacent"
is helpful here and it could lead to confusion. I think the word can simply
be dropped.

* On first reading it is not clear how to produce an IRI from database
columns without having the R2RML processor perform percent encoding. It
seems to me to be worth adding a note mentioning that this can be done by
creating a SQL query based logical table that includes an expression
building up an IRI value from the database. This would avoid using a
template and thus avoid the automatic percent encoding.

Section 7.6

* "...that does not have a specified langauge tag." - "langauge" ->
"language"

Section 7.7

* "A quoted and escaped data value is a SQL literal that can be used in a
SQL query, such as.." - Seems to me that this needs a bit more definition.
What does quoting mean? What does escaping mean?

Section 8

* It would be good for this section to reference the place in section 11
that describes how triples are generated from the "joint SQL query".
Otherwise this section appears to lead to a dead-end. It defines the joint
SQL query but doesn't say what to do with it.

* "SELECT * FROM ({child-query}) AS tmp" - I would change this to just:
child-query or: {child-query}

* "SELECT * FROM ({child-query}) AS child, ({parent-query}) AS parent" -
>From a SQL perspective, this feels wrong because it doesn't address the
issue of column name collisions between the child and parent query. Is the
thought that we can just ignore this because we describe in section 11 that
the projected columns are split between child and parent?

* "If the child query and parent query of a referencing object map are not
identical, then the referencing object map MUST have at least one join
condition." - I had to think a fair bit to come up with a scenario where
the child and parent query would be the same and there would not be any
join conditions. I finally concluded it would be a table like FAMILY, with
HUSBAND and WIFE columns. Both the HUSBAND and the WIFE values become
resources. Two TriplesMaps are defined on this table, one for Men and
another for Women. A referencing object map could be used to generate the
triples that tie these two resources together. This would result in a child
query and parent query being identical with no join conditions. In the
interest of making the spec understandable, I think it would be good to add
an example like this.

* I think it is worth noting in this section that the join condition
between child and parent does not have to be a 1:1 relationship. It can be
M:N.

Section 9.1

* "then the triples will share the same single blank node." - The word
"single" seems un-needed and awkward in this sentence. I would remove it.

Section 10

* "and in hte" - change "hte" -> "the"

Section 11

* "The output dataset MUST NOT contain any other RDF triples or named
graphs besides these." - I am curious to hear the thought process behind
this text. I assume it is to have well defined behavior for interop and
mapping portability? In our R2RML implementation, we intend to generate
additional triples that provide metadata about the generated triples. So
this text seems overly restrictive to us. I can see a few ways to address
our need: a) our system provides access to some other dataset which
includes triples from the R2RML mapping as well as additional triples b)
relax this text to allow the output dataset to include other named graphs
besided those defined in the mapping c) relax this text completely to allow
extra triples in the default graph and/or the named graphs defined by the
mapping.

Section 11.1

* "Each generated triple is placed into one or more particular graphs" - I
would strike the word "particular" from this sentence.

* "has no object map (but a referencing object map)" - This sentence was
hard to parse on first read. I think it would be clearer if it said "(but
_it does have_ a referencing...)"

* "has no referencing object map (but a normal object map)" - Same comment
as previous comment.

* "Target graphs, a set of zero or more IRIs" - To my reading of the steps,
the target graph is either a single IRI (i.e. rr:defaultGraph) or a set of
IRIs. I don't think it is ever a set of size zero. Futhermore, I think the
steps above need to be changed to produce singleton sets of rr:defaultGraph.

* "For each possible combination <s, p, o>" - For me this is a confusing
way to describe the process. I don't think it is helpful to describe the
subjects, predicates, and objects as sets and then take the cross-product
of them. As far as I can tell this description only works in the document
because the subjects, predicates, and objects are only ever singleton sets.
Furthermore the steps are not written to produce sets. So I would change
the description of "adding triples to the output dataset" to define
subject, predicate, and object as single values. Then taking those single
values and generating triples for each value in Target graphs.

* "adding triples to the output dataset" is defined as a term but the place
where that term is referenced uses different text. This makes it hard to
search the document for places whether the term is referenced. Therefore I
propose using the same text in both the definition and the uses.

Section 11.2

* "A term map is a function that generates a set of RDF terms from a
logical table row." - This statement is slightly different from the
statement in section 7: "A term map is a function that generates an RDF
term from a logical table row.". I think these two statements should match.

* This section seems to have two different positions running through it:
    a) term maps produce a set of RDF terms (where the set is either empty
or a singleton set?)
    b) term maps produce an RDF term or nothing
  It seems to me that we need to pick one or the other of these positions
and make all the sentences consistent with that positions (both here and in
earlier sections like section 7).

* "if the term map references a NULL value" - Since we have a definition
for "referenced columns" we should use that text here and make it a link.
Then the language could be tightened to something like "if any of the term
map's referenced columns have a NULL value"

Section B.2

* I found it a bit confusing to click on the property links (e.g. rr:child)
and to be taken to another document (I expected the properties to be
defined in the spec). Furthermore it is not clear to me (maybe I missed
something?) whether this related document is normative:
http://www.w3.org/ns/r2rml#

* "Note that additional constraints not stated in this table might apply,
and making a property forbidden or required in certain situations." - This
doesn't seem to be a grammatically correct sentence to me.

* With the addition of shortcuts, I think parts of this table are wrong.
e.g. rr:objectMap is now an optional property, so the cardinality for it
should be "0...1" rather than "1". I think this applies to several of the
properties. But, actually it is quite useful from a user and implementor
perspective to see a list of which _sets_ of properties are required.
Meaning that either rr:object or rr:objectMap is required. One way to
address this would be to list the shortcuts separately and define what
underlying property they imply.

* "R2RML view" and "SQL base table or view" are two concepts that appear in
the "Context" column but do not have corresponding classes in the schema.
Both of these concepts are informal sub-types of LogicalTable. This seems
like enough of a wrinkle that it warrants some additional explanation. I
certainly had to spend several minutes tracing everything to find all the
connections. Alternatively, why don't we create classes for these two
concepts?

* The referenced schema for rr:sqlQuery, rr:sqlVersion, and rr:tableName do
not specify a domain for these properties. Don't these all have a domain of
LogicalTable?
Received on Saturday, 29 October 2011 15:26:36 UTC