Re: Comments on Eric's Section 2 from Eric Prud'hommeaux on 2010-11-09 (public-rdb2rdf-wg@w3.org from November 2010)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 8 Nov 2010 23:12:13 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <20101109041211.GA14982@w3.org>
* Richard Cyganiak <richard@cyganiak.de> [2010-11-07 12:13+0800]
> ...
> For the record: If the issues that I list below can be addressed,
> along with the three from my other email I sent earlier, then I
> support publication of an FPWD that consists of:
> 
> - Eric's sections 1 and 2
> - followed by Eric's set semantics based formal approach
> - and Juan/Marcelo's datalog based formal approach
> - with an issue box explaining that both of these are
> work-in-progress candidates for the formal semantics.

I think we're closing in on that a common draft.

> And that's the last thing I intend to say about the direct mapping
> thingy until the three editors have managed to present the WG with a
> single version of the document endorsed by all of them.
> 
> Best,
> Richard
> 
> 
> Comments on Eric's draft
> 
> 1. Section 2.1 is IMHO unnecessary and confuses more than it helps.
> I would move its first two sentences into the Introduction, and
> remove the rest, in particular the SPARQL example. The same goes for
> the SPARQL example in 2.4, I would remove it. SPARQL query
> evaluation is a completely different topic and requires a ton of
> knowledge that is not essential for understanding the default
> mapping, so I honestly don't see how this helps the average reader.

I felt it oriented readers and helped the relatively large percentage
of perspective users who would exploit the direct graph via SPARQL.
OTOH, I'm not wedded to it; I'll just document the removed text:
[[
2.1.1 Use of the Direct Mapping

The Direct Mapping is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language
. It can be also used to materialize RDF graphs, such as the one above, or define virtual graphs,
which can be queried by SPARQL or traversed by an RDF graph API. When used to define a virtual
graph, SPARQL queries over that virtual graph (below, left) may be executed as SQL queries (below,
right), and the results transformed to RDF terms and return as SPARQL results.
┌────────────────────────────────────────────────────┐┌──────────────────────────────────────────────────────┐
│PREFIX People: <http://foo.example/DB/People#>      ││-- SQL capturing the SPARQL query's graph constraints.│
│PREFIX Addresses: <http://foo.example/DB/Addresses#>││SELECT People.fname AS name, Addresses.city      │
│SELECT ?name ?city         ││  FROM People          │
│         WHERE {         ││  JOIN Addresses                 │
│    ?who People:fname <?name> .       ││       ON Addresses.ID=People.addr       │
│    ?who People:addr ?address .       ││ WHERE People.fname IS NOT NULL               │
│    ?address People:city ?city .       ││   AND Addresses.city IS NOT NULL       │
│ }           ││                                                      │
└────────────────────────────────────────────────────┘└──────────────────────────────────────────────────────┘
]]

[[
2.5.1 Referencing Tables with Empty Primary Keys
...
The absence of a primary key forces the generation of blank nodes, but
does not change the structure of the direct graph or names of the
predicates in that graph. This example SPARQL query would find the
leads for Person 7's projects regardless of whether the Projects table
had a primary key or not. The following SPARQL query show how the
graph constraints might be captured in an RDF query; the SQL query
shows how it might be executed as SQL:

┌───────────────────────────────────────────────────────────────────┐┌──────────────────────────────────────────────────────┐
│PREFIX asgn: <http://foo.example/DB/TaskAssignments#>              ││-- SQL capturing the SPARQL query's graph constraints.│
│PREFIX proj: <http://foo.example/DB/Projects#>       ││SELECT lead            │
│SELECT ?lead           ││FROM TaskAssignments           │
│ WHERE {           ││JOIN Projects ON           │
│    ?assignment asgn:worker <http://foo.example/DB/People/ID=7#_> .││      Projects.project=TaskAssignments.name        │
│    ?assignment asgn:project_deptName_deptCity ?project .     ││  AND Projects.deptName=TaskAssignments.deptName      │
│    ?project proj:lead ?lead .         ││  AND Projects.deptCity=TaskAssignments.deptCity      │
│ }            ││WHERE Projects.worker=7                               │
└───────────────────────────────────────────────────────────────────┘└──────────────────────────────────────────────────────┘
]]

> 2. Section 2.2: The predicate for reference triples is described as:
> “an IRI composed of the stem, table name and column name and value
> for each column in the foreign key”. I don't understand why it says
> “and value”? The object is described as: “the subject created for
> the referred triple”. Do you mean “referenced row”?

done

> 3. Please provide a rationale for the “#_” at the end of generated
> IRIs in the text. In my opinion, this is entirely unnecessary and a
> useless complication. I see there is an issue box for that in the
> document, that's great, but if you want to have the “#_” thing in
> the FPWD then there should be text stating why it is necessary. My
> proposal for FPWD would be to s/#_//g and state in the issue box
> that this is subject to more discussion.

expanded issue box for now
[[
Issue (hash-vs-slash):

This edition of this document presumes hash identifiers. There is
nothing in this specification that encourages or discourages offering
the direct graph as Linked Open Data. LOD data identifiers tend to use
slash, but that slightly increases implementation burden and round
trips.
]]

I propose to get consensus on your FPWD proposal first, then address
the #_ issue in all of the examples. An alternative viewpoint came
from DanC in his review:
[[
2010-10-29T17:43:19Z <DanC> yay for #_
]]

> 4. Inconsistency: Section 2.2 states that predicate IRIs have
> hashes, while all the examples have slashes.

fixed (if we're speaking of the same place)

> 5. You should define the terms “row IRI” or “row identifier” and
> “column IRI”, and use them throughout, instead of saying sloppy
> things like “a IRI composed of the stem, table name and column name”
> or “the subject of the referenced row”. I think this is done pretty
> well in the directGraph/alt draft.

I think these are just in the forward reference to e.g. “row IRI” and
that they serve to indicate the rough construction of the identifiers.

> 6. Why a reference to [SQL99]? I thought we had agreed to use SQL
> Core 2008? You can copy the reference from the R2RML draft.

done

> 7. Both “URI” and “IRI” are used. I suppose it should be “IRI”
> everywhere?

Now only used in references to rdf-concepts.

> 8. In order to have an improved narrative in the section titles, I
> propose splitting 2.2 into one section “Identifiers for rows and
> columns” and one section “Row mapping rules”. (Not essential for
> FPWD)

I believe the current version has more structure and bolding than what
you reviwed. Has this addressed your comment?

> 9. Section 2.5: “Hierarchies” can refer to many things in an SQL
> context, so it's a bit hard to figure out what the section refers
> to. The first sentence should perhaps talk about “hierarchies of
> tables that represent specializations of the same concept” or
> something similar.

Is
[[
It is common to express specializations of some concept as mutiple
tables sharing a common primkary key.
]]
sufficient?

>                    The People table should perhaps be removed from
> the example, because it is not relevant to the example and makes
> understanding the relevant parts of the example harder.

done

> 10. Given that the question of many-to-many table mappings is an
> open issue, there should be at least a section about it that is
> empty except for an issue box. (I have more to say on this topic,
> but don't expect that discussion to be resolved before FPWD)

Added
[[
Issue (many-to-many-as-repeated-properties):

The direct graph is arguably more faithful to the conceptual model if
it reflects e.g. a person with multiple addresses (some many-to-many
Person2Address table) as repeated properties. It is difficult to
detect which tables with exatly two foreign keys and no other
attributes are many-to-many. As a counter example, a Wedding table may
have exactly two spouses but it's still not a many-to-many relation in
most places.
]]

> 11. See my comments to Juan and Marcelo asking for inclusion of
> table IRIs and of a triple that associates each row to its table.
> I'd really like to see a proposal for this in the FPWD, but at least
> an issue box would be essential. I note that the directGraph/alt
> version already has this.

The foreign-key-is-candidate-key situation *appears* to imply that the
same node is defined across multiple tables; saying that it's an
Address or an Office won't give you the critical information which is
what predicates came from what table. I propose instead:

<Offices#building> rdb2rdf:inTable <#Offices> .
and maybe
<Offices> rdb2rdf:inDatabase <> .

That way you can separate which triples came from which tables. You
can get the type effect where you want it by asserting
  <Offices#building> rdfs:domain <#Offices_row> .
-- 
-ericP
Received on Tuesday, 9 November 2010 04:12:50 UTC