Re: R2RML - Last Call Comments from Richard Cyganiak on 2011-11-07 (public-rdb2rdf-comments@w3.org from November 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 7 Nov 2011 17:04:12 +0000
To: David McNeil <dmcneil@revelytix.com>
Cc: public-rdb2rdf-comments@w3.org
Message-Id: <F344D4DB-FE19-4CE3-A914-9A69E519920B@cyganiak.de>
Hi David,

Thanks for these comments. Your attention to detail is much appreciated!

I have made changes to the Editor's Draft to address your comments:
http://www.w3.org/2001/sw/rdb2rdf/r2rml/

A few points need further discussion in the WG. In these cases I have raised tracker issues and added a link to the relevant issue.

Can you please let us know whether this response addresses your comments (except those that are noted as open issues and will be addressed later)?

Detailed response to your individual comments are inline.


On 29 Oct 2011, at 16:26, David McNeil wrote:
> Section 4
> 
> * "The base IRI MUST be a valid IRI. It SHOULD end in a slash (“/”) character." - I would have thought this would say that it should end in an IRI delimiter, rather than specifically referring to a slash.

The motivation for this is to make base URI resolution reversible.

Let's assume a base URI that ends in a delimiter other than “/” – let's say “#”:

  Base IRI: http://example.com/ns/data?
  Relative IRI: foo
  Resolved absolute IRI according to RFC 3987: http://example.com/ns/foo
  Resolved absolute IRI according to R2RML 11.2: http://example.com/ns/data?foo

As you can see, the absolute IRI constructed in 11.2 is different from the one that you'd get if you followed RFC 3987.

If the “SHOULD” that you quote above is followed – that is, if the base IRI ends in a slash – then the two algorithms tend to produce the same result. This minimises surprise for users who read “base IRI” and expect that the algorithm for resolving against the base IRI is the same as in other specs (RFC 3986).

On a closer examination of RFC 3986 I realise that ending in “/” is not enough if there's an “?” or “#” in the base IRI, so I updated the SHOULD to also discourage those two characters.
http://www.w3.org/2001/sw/rdb2rdf/r2rml/#dfn-base-iri

> * "Resolution of relative IRIs in R2RML uses simple string concatenation instead of the more complex algorithm defined in RFC 3986. This ensures that the original database value can be reconstructed from the generated IRI." - I did not understand this note until I read the corresponding note in section 11.2. I suppose that is because the latter note uses the phrase "obtaining an absolute IRI from a relative IRI", which has a more obvious, plain meaning than the phrase "resolution of relative IRIs". So I would propose changing this note to use the phrasing from the note in section 11.2.

I removed the Note in Section 11.2, and replaced the Note in Section 4 with wording that's closer to the wording from 11.2. Furthermore, I added an explanation that points out under which conditions exactly both algorithms produce the same result.
http://www.w3.org/2001/sw/rdb2rdf/r2rml/#dfn-base-iri

> * "An R2RML processor MAY include an R2RML data validator, but this is not required." - This seems to obscure the requirement from section 4.3 that an R2RML processor must also detect the same errors that an R2RML data validator detects.

This is incorrect. An R2RML processor does *not* have to detect the same errors as an R2RML data validator. Quoting from 4.3:

[[
A conforming R2RML processor may, however, allow other operations that do not require inspecting or returning these RDF terms [that give rise to a data error], and thus may provide partial access to an output dataset that contains data errors. Nevertheless, an R2RML processor should report data errors as early as possible.
]]
http://www.w3.org/2001/sw/rdb2rdf/r2rml/#dfn-data-error

I think that this section is very clear and specific about the requirements for R2RML processors related to data errors.

> Section 4.1
> 
> * "SHOULD NOT contain any mapping components (logical tables, term maps, predicate-object maps) that are not referenced by a triples map (in other words, are “unused”)" - I think the list of mapping components should either be complete, or formally defined elsewhere. If I am reading it correctly, it omits join condition and referencing object map. This seems like a needless lack of clarity in the spec.

I added a formal definition for mapping components:
http://www.w3.org/2001/sw/rdb2rdf/r2rml/#dfn-mapping-component

I changed the sentence you quote:

[[
SHOULD contain only mapping components that are referenced by some triples map (in other words, all mapping components should actually be “used” in the mapping).
]]

> * "MAY assign IRIs or blank node identifiers to any mapping component in order to enable re-use of mapping components within the mapping graph." - I see that the term "mapping component" is used again here, so maybe it does warrant a formal definition. I see the term "mapping constructs" is used just below this. Are these the same thing?

I changed everything to “mapping component” and linked it to the defintion.

> * "For example, an IRI that represents a subject map may be used as the subject map of multiple triples maps; and may even be used as an object map of another triples map if it has the right properties." - I notice that this is not an exhaustive list of how resources may be reused. Since it says "For example", does this count as non-normative? I only raise these questions. I am not enough of a spec-lawyer to know how best to handle this.

Quoting Section 3:

[[
As well as sections marked as non-normative in the section heading, all diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
]]
http://www.w3.org/2001/sw/rdb2rdf/r2rml/#conformance

Anything that is introduced as “For example, …” is non-normative.

> * "Using these classes is OPTIONAL in a mapping graph." - I think this wording could lead to confusion as it could be read to mean that an R2RML mapping could be defined without a TriplesMap. I think what is meant, as the next sentence clarifies, is that the triples defining the types of the mapping components are optional.

Changed:

[[
Explicitly typing the resources in a mapping graph with these classes is OPTIONAL.
]]

> * Since it doesn't say that an R2RML mapping graph MUST contain at least one TriplesMap, the implication is that an empty mapping graph that produces no triples is valid R2RML. I just mention this to confirm that is the intent.

That is the intent. I have no opinion either way. We should have a test case for that.

> * "The applicable class of a resource can always be inferred from its properties." - I think this captures a requirement on implementors that should be made more explicit. To get started processing an R2RML mapping, implementors must infer that a resource is a TriplesMap based on the presence of any of the mapping predicates which are required and have a domain of TriplesMap. Once you have the TriplesMaps then the rest of the types are implied by their position in the graph.

I added the following text to make this clearer:

[[
Explicit typing of the resources in a mapping graph with R2RML classes is optional. Their presence in a graph has no effect on the behaviour of an R2RML processor. The mapping component represented by any given resource in a mapping graph is defined by the presence or absence of certain properties, as defined throughout this specification.
]]

> This also raises the question of whether it should be required that TriplesMaps resources be explicitly identified as "a TriplesMap".

For reasons of consistency I would say no.

Authors or implementers would not benefit significantly from this explicit typing. It is easy to recognize potential rr:TriplesMaps by the presence of a handful of properties, e.g., every triples map MUST have exactly one rr:subject or rr:subjectMap. On the other hand, requiring explicit typing increases the risk of errors – the explicit type can disagree with the inferred type. It is redundant.

> Section 4.2
> 
> * "The preferred file extension is .ttl." - Is preferred a weaker form of SHOULD?

The Turtle spec says that the “recommended” file extension is .ttl, and RFC 2119 states that “RECOMMENDED” means the same as “SHOULD” (although the Turtle spec doesn't actually reference RFC 2119). I changed the sentence to:

[[
The file extension .ttl should be used.
]]

> 
> * "It is common to use document-local IRIs in mapping documents by defining the default prefix in the beginning of the document, and using it for creating IRIs for mapping components such as triples maps" - This note does not seem like it belongs in the spec.

I removed the note.

> Section 4.3
> 
> * "A data error is a condition of the data in the input database that would lead to the generation of an invalid RDF term, such as an invalid IRI or an ill-typed literal." - This seems like quite a loose definition. Since we have a more formal definition just below, can we strike this?

I combined both parts into one:

[[
A data error is a condition of the data in the input database that would lead to the generation of an invalid RDF term. The following conditions give rise to data errors:

	• A term map with term type rr:IRI results in the generation of an invalid IRI.
	• A term map with a datatype override produces an ill-typed literal of a supported RDF datatype.
]]

> * "A term map with a datatype override produces an ill-typed literal of a supported RDF datatype." Where "ill-typed" is defined in section 10 as "A typed literal of a supported RDF datatype is ill-typed if its lexical form is not in the lexical space of the RDF datatype identified by its datatype IRI." - Doesn't this definition of a mandatory error contradict our stated intent of: if data values fall outside of the intersection of SQL data types and XSD data types then the behavior is undefined (but throwing an error is not required)?

I'm not sure I follow. Can you give an example where this definition requires a data error against our stated intention?

> I don't remember seeing a place in the spec that captures our intent regarding the intersection of SQL and XSD data types.

Again, can you give an example where the spec doesn't capture that intention?

(Informally speaking, the spec has two modes: If you don't “override” the datatype by explicitly stating a different rr:datatype, then any standard SQL type is translated to XSD and this by definition cannot yield a data error. Any nonstandard type is turned into a plain literal by casting it to string.

> * "A logical table is a possibly virtual database table that is to be mapped to RDF triples. A logical table is either..." - I would remove the loose definition of "a possibly virtual database table" and combine these two sentences into one. The first time I read the sentence I thought it was talking about database views and database tables.

The first sentence is not intended as a definition but as an explanation of the role that logical tables play within R2RML. I considered just dropping the first sentence, but I'd like to keep some information on their role. I changed it to the following – do you think that's better?

[[
A logical table is a tabular SQL query result that is to be mapped to RDF triples. A logical table is either…
]]

> Section 5.1
> 
> * "A SQL base tables or views is represented by a resource that has exactly one rr:tableName property." - change "tables"->"table" and "views"->"view"
> * "If no catalog or schema are specified" - change "are"->"is"

Both fixed.

> * "If no catalog or schema are specified, then the default catalog and default schema of the SQL connection are assumed." - I think this is potentially confusing because it could be read to suggest that the R2RML implementor is supposed to do something wtih the default catalog/schema. But in fact the intent is for the identifier to be passed unchanged across the SQL connection.

I find it hard to improve the wording here. Wether the R2RML processor sends the identifier unchanged over the SQL connection, or modifies the identifier to include the name of the default catalog/schema, is really an implementation detail; the result would be the same anyway. Also, not all implementations necessarily work by sending SQL queries over a SQL connection; some R2RML processors might be part of a database engine, and might directly use internal mechanisms of the engine, so it wouldn't be quite right to specify behaviour in terms of an identifier that has to be sent over the SQL connection.

I'd prefer to leave this unchanged, unless you can propose some improved wording with the constraints above in mind.

> Section 5.2
> 
> * "A SQL query is a SELECT query in the SQL language that can be executed over the input database." - How about explicitly saying that it is a string representation of a SELECT query?

I tightened the definition of rr:sqlQuery a bit:

[[
… a resource that has exactly one rr:sqlQuery property, whose value is a literal with a lexical form that is a valid SQL query.
]]

And changed one of the sentences within the definition of “SQL query”:

[[
The string must conform to the production <direct select statement: multiple rows> in [SQL2]…
]]

> * "It MUST be a valid SQL query if executed over the SQL connection." - This could be read as "if a SQL query is executed over the SQL connection then it must be valid". I think the intent is that a SQL query must be a string that is valid to execute on the SQL connection as a query.

I changed the sentence to:

[[
It MUST be valid to execute over the SQL connection.
]]

> * "It MUST NOT have duplicate column names or unnamed derived columns in the SELECT list." - It seems obvious now but when I first read this I had trouble understanding what was meant by "unnamed derived columns". I think it means if the query includes expressions as projected columns then they must be named.

I split the sentence in two, and changed the second sentence to:

[[
Any columns in the SELECT list derived by projecting an expression must be named.
]]

Furthermore, I added an example after the definition:

[[
For example, the following SELECT query is not a valid R2RML SQL query because it (1) contains a duplicate DEPTNO column, and (2) contains an unnamed column derived from a COUNTexpression: …
]]

> * "For any database objects referenced without an explicit catalog name or schema name, the default catalog and default schema of the SQL connection are used." - Same comment as from previous section. I think this statement could be improved to clarify that an R2RML processor does not need to do anything to accomplish using the default catalog/schema.

See my comment above. I changed “the default XYZ are used” to “the default XYZ are assumed” to have a stronger connotation that it happens automatically. Beyond that I find it hard to improve the wording.

> Section 6
> 
> * "specifies how to generate the subjects for each row of the logical table" - I think this would be clearer if it said "generate a subject for each row". Otherwise it sounds like it might be possible for a subject map to generate multiple subjects from each row.
> * "together with the subjects generated by the subject map" - Similar to previous comment, changing "subjects"->"subject" I think removes ambiguity.

Both fixed.

> * "The referenced columns of all term maps of a triples map (subject map, predicate maps, object maps, graph maps) MUST be column names that exist in the term map's logical table." - Per the previous definition of "column name" this implies that the values cannot be qualified. Is that the intent? So a column reference in a mapping cannot be "EMP.JOB", it must be "JOB"?

Yes, in rr:column, rr:template etc only "JOB" should be allowed. But this is more tricky than I thought. I raised ISSUE-70 for this, we'll discuss and resolve it within the WG.
http://www.w3.org/2001/sw/rdb2rdf/track/issues/70

> Section 6.3
> 
> * "A predicate-object map is a function that creates predicate-object pairs from logical table rows." - Perhaps this would be clearer if it said "creates a predicate-object pair for each logical table row". This would avoid ambiguity about how many predicate-object pairs are created per logical table row.

Reworded to:

[[
… is a function that creates a predicate-object pair for each logical table row of a logical table.
]]

> Section 7.1
> 
> * "Occurrances of these properties..." - "Occurrances" -> "Occurrences"

Fixed

> * "aaa rr:subject bbb." - Using aaa and bbb as sample values makes the table messy, IMO. Would "x" and "y" serve the same purpose but look cleaner?

Changed to SPARQL-style “?x” and “?y”

> Section 7.2
> 
> * "the singleton set containing the value of rr:column" - For clarity, I think this should be changed to something like "... the value of _the rr:column property_". Otherwise it could be read as describing a set containing rr:column.

Changed to:

[[
… the singleton set containing the value of the term map's rr:column property.
]]

> Section 7.3
> 
> * "Backslash characters MUST be escaped by doubling them with another backslash character." - This reads ambiguously to me. It could be read as saying to use two or three backslashes. This possibility for confusion is compounded by the Turtle backslash handling. For clarity it could be changed to something like "Backslash characters MUST be escaped by preceding them with another backslash character."

Changed as suggested. I also modified the last example in the section to show the backslash doubling, and explicitly show various characters when they are mentioned: “\”, “\\” “{”, “}”

> * "If a template contains multiple pairs of unescaped curly braces, then adjacent pairs SHOULD be separated..." - I don't think the word "adjacent" is helpful here and it could lead to confusion. I think the word can simply be dropped.

As discussed at the F2F, the wording is changed to:

[[
… then any pair should be separated from the next one by …
]]

> * On first reading it is not clear how to produce an IRI from database columns without having the R2RML processor perform percent encoding. It seems to me to be worth adding a note mentioning that this can be done by creating a SQL query based logical table that includes an expression building up an IRI value from the database. This would avoid using a template and thus avoid the automatic percent encoding.

Added a Note:

[[
Note: R2RML always performs percent-encoding when IRIs are generated from string templates. If IRIs need to be generated without percent-encoding, then rr:column should be used instead ofrr:template, with an R2RML view that performs the string concatenation.
]]

> Section 7.6
> 
> * "...that does not have a specified langauge tag." - "langauge" -> "language"

Fixed.

> Section 7.7
> 
> * "A quoted and escaped data value is a SQL literal that can be used in a SQL query, such as.." - Seems to me that this needs a bit more definition. What does quoting mean? What does escaping mean?

Expanded this part:

[[
A quoted and escaped data value is a SQL string that matches the <literal> or <null specification> productions of [SQL2]. This string can be used in a SQL query to specify a SQL data value. Examples:

	• 27
	• 'foo'
	• 'foo''bar'
	• TRUE
	• DATE 2011-11-11
	• NULL
]]

> Section 8
> 
> * It would be good for this section to reference the place in section 11 that describes how triples are generated from the "joint SQL query". Otherwise this section appears to lead to a dead-end. It defines the joint SQL query but doesn't say what to do with it.

Added:

[[
The joint SQL query is used when generating RDF triples from referencing object maps.
]]

> * "SELECT * FROM ({child-query}) AS tmp" - I would change this to just: child-query or: {child-query}

The round braces are a necessary part of the SQL syntax; without them we'd end up with invalid joint SQL queries, e.g., “SELECT * FROM SELECT * FROM table1 AS tmp”.

I'd prefer to keep the curly braces too because throughout the spec, notation like {foo} is quite consistently used whenever some bit of a string needs replacing with another string.

> * "SELECT * FROM ({child-query}) AS child, ({parent-query}) AS parent" - From a SQL perspective, this feels wrong because it doesn't address the issue of column name collisions between the child and parent query. Is the thought that we can just ignore this because we describe in section 11 that the projected columns are split between child and parent?

I'm not sure if I thought this through when writing the section. I don't fully understand the need and motivation for the referencing object map feature, and don't have a clear picture of how it should operate.

I created ISSUE-71 for this.
http://www.w3.org/2001/sw/rdb2rdf/track/issues/71

> * "If the child query and parent query of a referencing object map are not identical, then the referencing object map MUST have at least one join condition." - I had to think a fair bit to come up with a scenario where the child and parent query would be the same and there would not be any join conditions. I finally concluded it would be a table like FAMILY, with HUSBAND and WIFE columns. Both the HUSBAND and the WIFE values become resources. Two TriplesMaps are defined on this table, one for Men and another for Women. A referencing object map could be used to generate the triples that tie these two resources together. This would result in a child query and parent query being identical with no join conditions. In the interest of making the spec understandable, I think it would be good to add an example like this.

Yes – this can be used to transform a denormalized DB schema into a normalized RDF representation. I've added an example:
http://www.w3.org/2001/sw/rdb2rdf/r2rml/#ex-ref-object-map2

> * I think it is worth noting in this section that the join condition between child and parent does not have to be a 1:1 relationship. It can be M:N.

Added a sentence to the first paragraph of the section:

[[
Since both triples maps may be based on different logical tables, this may require a join between the logical tables. This is not restricted to 1:1 joins.
]]

> Section 9.1
> 
> * "then the triples will share the same single blank node." - The word "single" seems un-needed and awkward in this sentence. I would remove it.

Done.

> 
> Section 10
> 
> * "and in hte" - change "hte" -> "the"

Fixed.

> Section 11
> 
> * "The output dataset MUST NOT contain any other RDF triples or named graphs besides these." - I am curious to hear the thought process behind this text. I assume it is to have well defined behavior for interop and mapping portability? In our R2RML implementation, we intend to generate additional triples that provide metadata about the generated triples. So this text seems overly restrictive to us. I can see a few ways to address our need: a) our system provides access to some other dataset which includes triples from the R2RML mapping as well as additional triples b) relax this text to allow the output dataset to include other named graphs besided those defined in the mapping c) relax this text completely to allow extra triples in the default graph and/or the named graphs defined by the mapping.

I added an additional sentence:

[[
The output dataset MUST NOT contain any other RDF triples or named graphs besides these. However, R2RML processors MAY provide access to datasets that contain additional triples or graphs beyond those in the output dataset, such as inferred triples or provenance information.
]]

> Section 11.1
> 
> * "Each generated triple is placed into one or more particular graphs" - I would strike the word "particular" from this sentence.
> 
> * "has no object map (but a referencing object map)" - This sentence was hard to parse on first read. I think it would be clearer if it said "(but _it does have_ a referencing...)"
> 
> * "has no referencing object map (but a normal object map)" - Same comment as previous comment.

All changed as suggested.

> * "Target graphs, a set of zero or more IRIs" - To my reading of the steps, the target graph is either a single IRI (i.e. rr:defaultGraph) or a set of IRIs. I don't think it is ever a set of size zero. Futhermore, I think the steps above need to be changed to produce singleton sets of rr:defaultGraph.
> * "For each possible combination <s, p, o>" - For me this is a confusing way to describe the process. I don't think it is helpful to describe the subjects, predicates, and objects as sets and then take the cross-product of them. As far as I can tell this description only works in the document because the subjects, predicates, and objects are only ever singleton sets. Furthermore the steps are not written to produce sets. So I would change the description of "adding triples to the output dataset" to define subject, predicate, and object as single values. Then taking those single values and generating triples for each value in Target graphs.
> * "A term map is a function that generates a set of RDF terms from a logical table row." - This statement is slightly different from the statement in section 7: "A term map is a function that generates an RDF term from a logical table row.". I think these two statements should match.
> * This section seems to have two different positions running through it:
>     a) term maps produce a set of RDF terms (where the set is either empty or a singleton set?) 
>     b) term maps produce an RDF term or nothing
>   It seems to me that we need to pick one or the other of these positions and make all the sentences consistent with that positions (both here and in earlier sections like section 7).

I acknowledge that cleanup related to these four comments is needed in Section 11. We plan to do this once ISSUE-68 and ISSUE-72 are resolved. We will track the cleanup task as ISSUE-73:
http://www.w3.org/2001/sw/rdb2rdf/track/issues/73

> * "adding triples to the output dataset" is defined as a term but the place where that term is referenced uses different text. This makes it hard to search the document for places whether the term is referenced. Therefore I propose using the same text in both the definition and the uses.

Done.

> Section 11.2
> 
> * "if the term map references a NULL value" - Since we have a definition for "referenced columns" we should use that text here and make it a link. Then the language could be tightened to something like "if any of the term map's referenced columns have a NULL value"

Now says this, and “referenced columns” is a link:

[[
The empty set – if any of the referenced columns of the term map has a NULL value,
]]

> Section B.2
> 
> * I found it a bit confusing to click on the property links (e.g. rr:child) and to be taken to another document (I expected the properties to be defined in the spec). Furthermore it is not clear to me (maybe I missed something?) whether this related document is normative: http://www.w3.org/ns/r2rml#
> * But, actually it is quite useful from a user and implementor perspective to see a list of which _sets_ of properties are required. Meaning that either rr:object or rr:objectMap is required. One way to address this would be to list the shortcuts separately and define what underlying property they imply.

I opened ISSUE-74 for this. The question is whether the B.2 table can be re-designed or augmented with additional text to make it more useful and avoid the click-through confusion.
http://www.w3.org/2001/sw/rdb2rdf/track/issues/74

> * "Note that additional constraints not stated in this table might apply, and making a property forbidden or required in certain situations." - This doesn't seem to be a grammatically correct sentence to me.

Changed:

[[
Note that additional constraints not stated in this table might apply, and the actual cardinality may depend on the presence or absence of other properties, and their values.
]]

> * With the addition of shortcuts, I think parts of this table are wrong. e.g. rr:objectMap is now an optional property, so the cardinality for it should be "0...1" rather than "1". I think this applies to several of the properties. 


Fixed.

> * "R2RML view" and "SQL base table or view" are two concepts that appear in the "Context" column but do not have corresponding classes in the schema. Both of these concepts are informal sub-types of LogicalTable. This seems like enough of a wrinkle that it warrants some additional explanation. I certainly had to spend several minutes tracing everything to find all the connections. Alternatively, why don't we create classes for these two concepts?
> * The referenced schema for rr:sqlQuery, rr:sqlVersion, and rr:tableName do not specify a domain for these properties. Don't these all have a domain of LogicalTable?

I have created classes rr:R2RMLView and rr:BaseTableOrView in the spec. They still need to be added to the schema, and will be used as the domain of the properties you mention. That's ACTION-166:
http://www.w3.org/2001/sw/rdb2rdf/track/actions/166

Best,
Richard
Received on Monday, 7 November 2011 17:04:55 UTC