- From: David McNeil <dmcneil@revelytix.com>
- Date: Sat, 29 Oct 2011 10:26:05 -0500
- To: public-rdb2rdf-comments@w3.org
- Message-ID: <CA+8VvdxKuaoU0uKNhcWRRami+0OjYN0-9=WyhDwAoJZ6uKUojw@mail.gmail.com>
I reviewed the Last Call version of the R2RML spec. My comments are below. -David ==== Section 4 * "The base IRI MUST be a valid IRI. It SHOULD end in a slash (“/”) character." - I would have thought this would say that it should end in an IRI delimiter, rather than specifically referring to a slash. * "Resolution of relative IRIs in R2RML uses simple string concatenation instead of the more complex algorithm defined in RFC 3986. This ensures that the original database value can be reconstructed from the generated IRI." - I did not understand this note until I read the corresponding note in section 11.2. I suppose that is because the latter note uses the phrase "obtaining an absolute IRI from a relative IRI", which has a more obvious, plain meaning than the phrase "resolution of relative IRIs". So I would propose changing this note to use the phrasing from the note in section 11.2. * "An R2RML processor MAY include an R2RML data validator, but this is not required." - This seems to obscure the requirement from section 4.3 that an R2RML processor must also detect the same errors that an R2RML data validator detects. Section 4.1 * "SHOULD NOT contain any mapping components (logical tables, term maps, predicate-object maps) that are not referenced by a triples map (in other words, are “unused”)" - I think the list of mapping components should either be complete, or formally defined elsewhere. If I am reading it correctly, it omits join condition and referencing object map. This seems like a needless lack of clarity in the spec. * "MAY assign IRIs or blank node identifiers to any mapping component in order to enable re-use of mapping components within the mapping graph." - I see that the term "mapping component" is used again here, so maybe it does warrant a formal definition. I see the term "mapping constructs" is used just below this. Are these the same thing? * "For example, an IRI that represents a subject map may be used as the subject map of multiple triples maps; and may even be used as an object map of another triples map if it has the right properties." - I notice that this is not an exhaustive list of how resources may be reused. Since it says "For example", does this count as non-normative? I only raise these questions. I am not enough of a spec-lawyer to know how best to handle this. * "Using these classes is OPTIONAL in a mapping graph." - I think this wording could lead to confusion as it could be read to mean that an R2RML mapping could be defined without a TriplesMap. I think what is meant, as the next sentence clarifies, is that the triples defining the types of the mapping components are optional. * Since it doesn't say that an R2RML mapping graph MUST contain at least one TriplesMap, the implication is that an empty mapping graph that produces no triples is valid R2RML. I just mention this to confirm that is the intent. * "The applicable class of a resource can always be inferred from its properties." - I think this captures a requirement on implementors that should be made more explicit. To get started processing an R2RML mapping, implementors must infer that a resource is a TriplesMap based on the presence of any of the mapping predicates which are required and have a domain of TriplesMap. Once you have the TriplesMaps then the rest of the types are implied by their position in the graph. This also raises the question of whether it should be required that TriplesMaps resources be explicitly identified as "a TriplesMap". Section 4.2 * "The preferred file extension is .ttl." - Is preferred a weaker form of SHOULD? * "It is common to use document-local IRIs in mapping documents by defining the default prefix in the beginning of the document, and using it for creating IRIs for mapping components such as triples maps" - This note does not seem like it belongs in the spec. Section 4.3 * "A data error is a condition of the data in the input database that would lead to the generation of an invalid RDF term, such as an invalid IRI or an ill-typed literal." - This seems like quite a loose definition. Since we have a more formal definition just below, can we strike this? * "A term map with a datatype override produces an ill-typed literal of a supported RDF datatype." Where "ill-typed" is defined in section 10 as "A typed literal of a supported RDF datatype is ill-typed if its lexical form is not in the lexical space of the RDF datatype identified by its datatype IRI." - Doesn't this definition of a mandatory error contradict our stated intent of: if data values fall outside of the intersection of SQL data types and XSD data types then the behavior is undefined (but throwing an error is not required)? I don't remember seeing a place in the spec that captures our intent regarding the intersection of SQL and XSD data types. * "A logical table is a possibly virtual database table that is to be mapped to RDF triples. A logical table is either..." - I would remove the loose definition of "a possibly virtual database table" and combine these two sentences into one. The first time I read the sentence I thought it was talking about database views and database tables. Section 5.1 * "A SQL base tables or views is represented by a resource that has exactly one rr:tableName property." - change "tables"->"table" and "views"->"view" * "If no catalog or schema are specified" - change "are"->"is" * "If no catalog or schema are specified, then the default catalog and default schema of the SQL connection are assumed." - I think this is potentially confusing because it could be read to suggest that the R2RML implementor is supposed to do something wtih the default catalog/schema. But in fact the intent is for the identifier to be passed unchanged across the SQL connection. Section 5.2 * "A SQL query is a SELECT query in the SQL language that can be executed over the input database." - How about explicitly saying that it is a string representation of a SELECT query? * "It MUST be a valid SQL query if executed over the SQL connection." - This could be read as "if a SQL query is executed over the SQL connection then it must be valid". I think the intent is that a SQL query must be a string that is valid to execute on the SQL connection as a query. * "It MUST NOT have duplicate column names or unnamed derived columns in the SELECT list." - It seems obvious now but when I first read this I had trouble understanding what was meant by "unnamed derived columns". I think it means if the query includes expressions as projected columns then they must be named. * "For any database objects referenced without an explicit catalog name or schema name, the default catalog and default schema of the SQL connection are used." - Same comment as from previous section. I think this statement could be improved to clarify that an R2RML processor does not need to do anything to accomplish using the default catalog/schema. Section 6 * "specifies how to generate the subjects for each row of the logical table" - I think this would be clearer if it said "generate a subject for each row". Otherwise it sounds like it might be possible for a subject map to generate multiple subjects from each row. * "together with the subjects generated by the subject map" - Similar to previous comment, changing "subjects"->"subject" I think removes ambiguity. * "The referenced columns of all term maps of a triples map (subject map, predicate maps, object maps, graph maps) MUST be column names that exist in the term map's logical table." - Per the previous definition of "column name" this implies that the values cannot be qualified. Is that the intent? So a column reference in a mapping cannot be "EMP.JOB", it must be "JOB"? Section 6.3 * "A predicate-object map is a function that creates predicate-object pairs from logical table rows." - Perhaps this would be clearer if it said "creates a predicate-object pair for each logical table row". This would avoid ambiguity about how many predicate-object pairs are created per logical table row. Section 7.1 * "Occurrances of these properties..." - "Occurrances" -> "Occurrences" * "aaa rr:subject bbb." - Using aaa and bbb as sample values makes the table messy, IMO. Would "x" and "y" serve the same purpose but look cleaner? Section 7.2 * "the singleton set containing the value of rr:column" - For clarity, I think this should be changed to something like "... the value of _the rr:column property_". Otherwise it could be read as describing a set containing rr:column. Section 7.3 * "Backslash characters MUST be escaped by doubling them with another backslash character." - This reads ambiguously to me. It could be read as saying to use two or three backslashes. This possibility for confusion is compounded by the Turtle backslash handling. For clarity it could be changed to something like "Backslash characters MUST be escaped by preceding them with another backslash character." * "If a template contains multiple pairs of unescaped curly braces, then adjacent pairs SHOULD be separated..." - I don't think the word "adjacent" is helpful here and it could lead to confusion. I think the word can simply be dropped. * On first reading it is not clear how to produce an IRI from database columns without having the R2RML processor perform percent encoding. It seems to me to be worth adding a note mentioning that this can be done by creating a SQL query based logical table that includes an expression building up an IRI value from the database. This would avoid using a template and thus avoid the automatic percent encoding. Section 7.6 * "...that does not have a specified langauge tag." - "langauge" -> "language" Section 7.7 * "A quoted and escaped data value is a SQL literal that can be used in a SQL query, such as.." - Seems to me that this needs a bit more definition. What does quoting mean? What does escaping mean? Section 8 * It would be good for this section to reference the place in section 11 that describes how triples are generated from the "joint SQL query". Otherwise this section appears to lead to a dead-end. It defines the joint SQL query but doesn't say what to do with it. * "SELECT * FROM ({child-query}) AS tmp" - I would change this to just: child-query or: {child-query} * "SELECT * FROM ({child-query}) AS child, ({parent-query}) AS parent" - >From a SQL perspective, this feels wrong because it doesn't address the issue of column name collisions between the child and parent query. Is the thought that we can just ignore this because we describe in section 11 that the projected columns are split between child and parent? * "If the child query and parent query of a referencing object map are not identical, then the referencing object map MUST have at least one join condition." - I had to think a fair bit to come up with a scenario where the child and parent query would be the same and there would not be any join conditions. I finally concluded it would be a table like FAMILY, with HUSBAND and WIFE columns. Both the HUSBAND and the WIFE values become resources. Two TriplesMaps are defined on this table, one for Men and another for Women. A referencing object map could be used to generate the triples that tie these two resources together. This would result in a child query and parent query being identical with no join conditions. In the interest of making the spec understandable, I think it would be good to add an example like this. * I think it is worth noting in this section that the join condition between child and parent does not have to be a 1:1 relationship. It can be M:N. Section 9.1 * "then the triples will share the same single blank node." - The word "single" seems un-needed and awkward in this sentence. I would remove it. Section 10 * "and in hte" - change "hte" -> "the" Section 11 * "The output dataset MUST NOT contain any other RDF triples or named graphs besides these." - I am curious to hear the thought process behind this text. I assume it is to have well defined behavior for interop and mapping portability? In our R2RML implementation, we intend to generate additional triples that provide metadata about the generated triples. So this text seems overly restrictive to us. I can see a few ways to address our need: a) our system provides access to some other dataset which includes triples from the R2RML mapping as well as additional triples b) relax this text to allow the output dataset to include other named graphs besided those defined in the mapping c) relax this text completely to allow extra triples in the default graph and/or the named graphs defined by the mapping. Section 11.1 * "Each generated triple is placed into one or more particular graphs" - I would strike the word "particular" from this sentence. * "has no object map (but a referencing object map)" - This sentence was hard to parse on first read. I think it would be clearer if it said "(but _it does have_ a referencing...)" * "has no referencing object map (but a normal object map)" - Same comment as previous comment. * "Target graphs, a set of zero or more IRIs" - To my reading of the steps, the target graph is either a single IRI (i.e. rr:defaultGraph) or a set of IRIs. I don't think it is ever a set of size zero. Futhermore, I think the steps above need to be changed to produce singleton sets of rr:defaultGraph. * "For each possible combination <s, p, o>" - For me this is a confusing way to describe the process. I don't think it is helpful to describe the subjects, predicates, and objects as sets and then take the cross-product of them. As far as I can tell this description only works in the document because the subjects, predicates, and objects are only ever singleton sets. Furthermore the steps are not written to produce sets. So I would change the description of "adding triples to the output dataset" to define subject, predicate, and object as single values. Then taking those single values and generating triples for each value in Target graphs. * "adding triples to the output dataset" is defined as a term but the place where that term is referenced uses different text. This makes it hard to search the document for places whether the term is referenced. Therefore I propose using the same text in both the definition and the uses. Section 11.2 * "A term map is a function that generates a set of RDF terms from a logical table row." - This statement is slightly different from the statement in section 7: "A term map is a function that generates an RDF term from a logical table row.". I think these two statements should match. * This section seems to have two different positions running through it: a) term maps produce a set of RDF terms (where the set is either empty or a singleton set?) b) term maps produce an RDF term or nothing It seems to me that we need to pick one or the other of these positions and make all the sentences consistent with that positions (both here and in earlier sections like section 7). * "if the term map references a NULL value" - Since we have a definition for "referenced columns" we should use that text here and make it a link. Then the language could be tightened to something like "if any of the term map's referenced columns have a NULL value" Section B.2 * I found it a bit confusing to click on the property links (e.g. rr:child) and to be taken to another document (I expected the properties to be defined in the spec). Furthermore it is not clear to me (maybe I missed something?) whether this related document is normative: http://www.w3.org/ns/r2rml# * "Note that additional constraints not stated in this table might apply, and making a property forbidden or required in certain situations." - This doesn't seem to be a grammatically correct sentence to me. * With the addition of shortcuts, I think parts of this table are wrong. e.g. rr:objectMap is now an optional property, so the cardinality for it should be "0...1" rather than "1". I think this applies to several of the properties. But, actually it is quite useful from a user and implementor perspective to see a list of which _sets_ of properties are required. Meaning that either rr:object or rr:objectMap is required. One way to address this would be to list the shortcuts separately and define what underlying property they imply. * "R2RML view" and "SQL base table or view" are two concepts that appear in the "Context" column but do not have corresponding classes in the schema. Both of these concepts are informal sub-types of LogicalTable. This seems like enough of a wrinkle that it warrants some additional explanation. I certainly had to spend several minutes tracing everything to find all the connections. Alternatively, why don't we create classes for these two concepts? * The referenced schema for rr:sqlQuery, rr:sqlVersion, and rr:tableName do not specify a domain for these properties. Don't these all have a domain of LogicalTable?
Received on Saturday, 29 October 2011 15:26:36 UTC