Re: CSV+ Direct Mapping candidate? from David Booth on 2014-03-04 (public-csv-wg@w3.org from March 2014)

From: David Booth <david@dbooth.org>
Date: Mon, 03 Mar 2014 20:14:26 -0500
To: Gregg Kellogg <gregg@greggkellogg.net>, Andy Seaborne <andy@apache.org>
CC: Richard Cyganiak <richard@cyganiak.de>, Niklas Lindström <lindstream@gmail.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <531528F2.8010409@dbooth.org>
Hi Gregg, Andy, Richard and others,

On 03/02/2014 12:57 PM, Gregg Kellogg wrote:
> On Mar 2, 2014, at 9:43 AM, Andy Seaborne <andy@apache.org> wrote:
>>
>> On 01/03/14 14:59, Richard Cyganiak wrote:
>>> David,
>>>
>>> Let me first add one more clarification. I don't think of a
>>> Tarql mapping as a CSV-to-RDF mapping. I think of it as a
>>> logical-table-to-RDF mapping. Whether the table comes from CSV,
>>> TSV, SAS, SPSS or relational doesn't matter, as long as we define
>>> a sensible mapping from each of these syntaxes to a table of RDF
>>> terms with named columns. These mappings are generally easy to
>>> define, lossless, and don't add much arbitrary extra
>>> information.
>>
>> +1 to having this step brought out explicitly.  We can deal with
>> syntax to RDF terms step, involving syntax details and any
>> additional information to guide choice of datatypes (is 2014 a
>> string, an integer, a Gregorian year?), and then have a step of
>> putting into RDF, whether direct or mapped.
>
> +1 too.
>
> IMO, 2014 is an integer, "2014" is a string. Column metadata should
> be able to type field as datatyped literal, reference or identifier.
>
> Direct mapping simply generates either anonymous records, or records
> identified by fragid, also using fragids to compose properties based
> on column names: simplest possible transformation to triples in the
> absence of metadata. Mapping metadata allows more sophisticated
> mappings.

Correct, but therein also lies the danger.

I'm afraid the following explanation is rather long, and may be obvious 
to many -- and if so, I apologize -- but I want to be sure that I'm 
being as clear as possible, because I realized that I was *not* 
sufficiently clear in the use case that I submitted, plus I've learned 
(the hard way) that different people sometimes come into these efforts 
with quite different expectations and objectives.  What one person 
thinks is obvious may not be at all obvious to someone else who has 
different objectives.

The reason I like the Direct Mapping approach is that it cleanly factors 
out the simple, syntactic mapping from the semantic transformations that 
are needed to achieve alignment with some target model.  It then allows 
*all* of the semantic mappings to be done in the same language, 
regardless of source format, rather than mixing the syntactic mapping 
with the semantic alignment step.  I like this because, to my mind, when 
integrating data from diverse sources, there will almost always be 
semantic transformations needed to achieve semantic alignment, and I 
would rather use a single, common way to do those semantic 
transformations than having several ways to do them, each one specific 
to a particular source format.   (One question though is, where is the 
line between syntactic and semantic transformation?  I'll come back to 
that later.)

Suppose an application uses a particular **target RDF model** or 
ontology, i.e., the application expects its input data to use certain 
classes, predicates, namespaces, and other usage patterns.   To consume 
data from a particular source, two things logically need to happen: (a) 
syntactic mapping to convert the data from whatever format it is in, to 
RDF; and (b) semantic mapping to align the source data model with the 
target RDF model.  These two logical steps can either be done as one 
single physical step or as more than one physical step.  In general, the 
publisher of the source data has no knowledge of the application that is 
consuming that data, and hence cannot be expected to provide sufficient 
metadata that would map the source data all the way to that 
application's target RDF model.  Indeed, there may be *many* such 
applications, each one with its own target RDF model.   Hence, all the 
publisher can do is to (at most) provide metadata that allows a consumer 
to automatically map the source data to a **source RDF model**.

There are various ways the publisher can conceive of the source RDF 
model.  In general, the more complex the mapping, the more it becomes 
biased toward a particular assumed application, rather than simply 
reflecting the intended meaning of the published data.   Ideally, the 
publisher should supply metadata that expresses the intended meaning of 
the published data, *without* biasing it toward any particular 
application.   Metadata that can be automatically deterministically 
discovered (merely by following standards) could certainly include 
mapping rules that are intended to perform exactly this purpose, and I 
assume that that is what you and others have in mind by defining a 
standard way to locate and represent CSV+ metadata.  To my mind, this 
should be a major focus of the working group's efforts.

So far so good.  But when something like Tarql is used, there is no 
clean division between the syntactic mapping and the semantic mapping, 
because Tarql allows *any* semantic transformation to be performed. 
This can be both good and bad.  Obviously it can be useful to have that 
expressive power.  But there is also a danger that the publisher may 
comingle the task of exposing the meaning of that data, with the task of 
aligning its RDF model to that of a particular application, rather than 
cleanly separating the two.  Indeed, if the publisher also has an 
application that needs to consume the data as RDF, then the publisher 
will have a great deal of temptation to do so, because it will simplify 
his/her immediate task.  But doing so may introduce model bias that 
makes it harder for other applications to use the data --  particularly 
if the resulting model involves information loss.  This can easily 
happen if the application that the publisher has in mind does not need 
some of the information that is in the data, and it can happen 
completely unconsciously, as the publisher may not have conceived of the 
many creative uses to which that data may be put.  (Of course, it could 
also make the data *easier* for other applications to use, if those 
other applications happen to use target RDF models that are the same or 
almost the same as the model used by the publisher's application.)

In spite of such a danger, it is only reasonable to assume that the 
publisher best knows the intended meaning of his/her data, and thus in 
some sense we simply have to trust him/her to exercise good judgement in 
publishing metadata that as faithfully as possible reflects the intended 
meaning of the data without bias toward any particular target model. 
Following this line of thinking, one could reasonably argue that 
publishers should have the power, when writing their CSV+ metadata, to 
specify arbitrary semantic transformations even though that power may 
sometimes be abused.

But do CSV+ publishers *need* the power to express arbitrary semantic 
transformations, which Tarql (or SPARQL) provides, just to expose the 
intended meaning of a table?  I'm not sure that they do.  I'm hoping 
that a more constrained, declarative form will suffice for the simple 
task of exposing the data's intended meaning.   This indeed may be what 
most members of the working group have in mind already, and if so then 
that's great, but I thought I should bring it up explicitly.

I also think there's another factor that should be considered, which 
I'll try to illustrate with two scenarios.

In scenario #1, a data consumer locates a published CSV+ table that has 
*no* accompanying metadata.  A CSV+ Direct Mapping is applied, to 
interpret that table as RDF.  Two mappings are then crafted to transform 
that RDF to RDF that is semantically aligned with the consumer's target 
model: the first mapping, MS, transforms the directly mapped RDF to -- 
hopefully -- the publisher's intended source model by replacing default 
URI prefixes and datatypes with intended prefixes and datatypes, and 
maybe a little more; the second mapping, MT, transforms that source 
model to the consumer's target model.  Clearly MT must be written by the 
consumer, but MS might actually be written by the publisher, or someone 
who at least understands the data's meaning.

In scenario #2, the publisher of that same CSV+ table installs 
accompanying metadata for that spreadsheet.  In doing so, it would be 
nice if: (a) the publisher could simply install mapping MS from scenario 
#1 in the right location without change (assuming MS does indeed reflect 
the intended source model); (b) by following the W3C standard, the 
consumer would then view the data as RDF that reflects the intended 
source model; (c) the consumer could still use mapping MT (unchanged) to 
transform from the source model to the consumer's target model; and (d) 
MS and MT are written in the same language.  In other words, it would be 
nice if these mappings did not have to be rewritten just because MS is 
moved to accompany the published table.

BTW, in going through this explanation, it occurs to me that I was not 
sufficiently clear in my use case description
http://lists.w3.org/Archives/Public/public-csv-wg-comments/2014Feb/0007.html
because I only addressed the case in which there is no metadata 
available.  I hope that this lengthy explanation has helped to clarify 
my goals.  In particular, I hope:

  - for a standard, deterministic mapping from *any* published CSV+ 
table to RDF;

  - that such a mapping will use any associated authoritative metadata 
to best capture the publisher's intended meaning of the data;

  - that syntactic mappings are *decoupled* from semantic mappings;

  - that there is a CSV+ Direct Mapping style that prevents or 
discourages model bias (as described above) by preventing or 
discouraging semantic mappings in the metadata; and

  - that semantic mappings are SPARQL-rules-friendly, either by using 
SPARQL conventions or by using using conventions that can be 
conveniently used from SPARQL.

In this distinction, I am viewing semantic mappings as being 
transformations from RDF to RDF.  I think there is some gray area in 
what operations should be categorized as syntactic mappings versus 
semantic mappings, as some might be legitimately viewed as both.  Rules 
of thumb that come to mind for semantic mappings:

  - those that are also used to achieve model alignment, i.e., they are 
not solely used for one data format; or

  - those that change the structure of the data model, rather than just 
the terms.

Again, I apologize for the length of this explanation, but I hope it has 
added more clarity.

Thanks!
David Booth
Received on Tuesday, 4 March 2014 01:14:55 UTC