Re: CSV+ Direct Mapping candidate? from Richard Cyganiak on 2014-03-04 (public-csv-wg@w3.org from March 2014)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Tue, 4 Mar 2014 17:13:39 +0900
To: David Booth <david@dbooth.org>
Cc: Niklas Lindström <lindstream@gmail.com>, Gregg Kellogg <gregg@greggkellogg.net>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-Id: <715D24B9-EA80-4BF0-B24E-80F39698C734@cyganiak.de>
On 4 Mar 2014, at 02:54, David Booth <david@dbooth.org> wrote:
> On 03/01/2014 09:59 AM, Richard Cyganiak wrote:
>> David,
>> 
>> Let me first add one more clarification. I don't think of a Tarql
>> mapping as a CSV-to-RDF mapping. I think of it as a
>> logical-table-to-RDF mapping. Whether the table comes from CSV, TSV,
>> SAS, SPSS or relational doesn't matter, as long as we define a
>> sensible mapping from each of these syntaxes to a table of RDF terms
>> with named columns. These mappings are generally easy to define,
>> lossless, and don't add much arbitrary extra information.
>> 
>> It's worth pointing out that such a table of RDF terms with named
>> columns is the same logical structure as a SPARQL result, so it's
>> something that is already supported to various degrees in most RDF
>> toolkits.
>> 
>> Now, you said:
>> 
>>> The overall goal is to be able to predictably view all data
>>> uniformly as RDF.
>> 
>> I ask: why? What is the reason for wanting to do that?
> 
> Obviously the goal is for information integration -- to make use of RDF's value proposition.

But that usually requires integration on the schema level, not just on the data model level. So, it requires going all the way to C, not just to B.

Without schema-level integration, all you can do really is explore the data to some extent. And while that's a good first step towards achieving full integration, my experience is that you can’t really use the data in applications unless you go all the way to C. At the same time, by going from A to B, you’ve lost the ability to use the data in its native environment (e.g., you can’t open it in Excel or Refine or SPSS). So, in my eyes, you’re in the negative at B, and only come out ahead once you get pretty close to C.

>> Especially,
>> why would you dissolve a cleanly structured table into a soup of
>> triples by inserting meaningless and arbitrary extra elements? What
>> do you win by doing that?
> 
> Quite the opposite.  The goal is to expose the table's intended information as RDF -- no more and no less.  It is not to insert meaningless or arbitrary extra elements,

But it requires some arbitrary decisions. There are many ways to turn the same table into different graphs. How do you turn column names into property URIs? Is each cell a resource, or just the rows? And while the answers to these questions can be standardised (as you propose), they *do* insert extra information that isn’t a meaningful part of the original data, but that the query writer now needs to be aware of.

> nor to discard potentially meaningful information.

Well, I will argue further below that it *does* discard meaningful information—the strong constraints inherent in the tabular structure.

>> I observe that if your toolkit supports SPARQL results and Tarql,
>> then it can already display tables of RDF terms, query them, and
>> transform them to arbitrary graphs according to a mapping. What else
>> would you want to do with them that requires the
>> direct-mapping-to-triple-soup?
> 
> Semantic transformations: transforming from one data model to another. This is almost always needed when integrating data.

Are you saying that you only want to go from A to B as an intermediate step to C? If that is the case, then I maintain my view that going through B is an unnecessary complication, and the focus should be on getting from A to C with as few intermediate steps as possible.

Are there other reasons why you want the direct mapped triples *besides* writing RDF-to-RDF mappings against them?

>> I think I'd rather like to see an RDF ecosystem with excellent
>> support for tables as well as graphs, while you'd rather like to see
>> an RDF ecosystem where everyone treats everything as a graph, even if
>> it's not a graph.
> 
> I want to decouple syntactic lift from semantic transformations, so that all semantic transformations can be uniformly performed in RDF.  I would rather not have to deal with a different semantic transformation language for each data format.  

Okay, this explanation makes sense to me.

But I think this again mischaracterises the Tarql approach by implying that it’s part of an agenda that requires a different transformation language for each data format.

First, the Tarql approach is not for *one* data format, but easily addresses a large number of data formats, and can easily address even more with slight extensions. That’s because it is a transformation language for *logical tables*. The majority of data formats out there work with such an abstract tabular model.

Second, Tarql is a syntactic subset of SPARQL (while it should probably be a small superset), and just requires some slight additions to SPARQL’s semantics. So, it’s not a new language. Just for context, the difference between SPARQL and Tarql is tiny compared to the difference between SPARQL 1.0 vs 1.1.

> I prefer to factor out the task of syntactic lift, from the task of semantic transformation, so that: (a) the same semantic transformation language and tools can be used regardless of the data's source format; and (b) model bias will not be introduced into RDF that is exposed.  I'll explain more what I mean in a separate reply to Gregg and Andy.
> 
> As you know, RDF can perfectly well represent tabular data,

This is not true in my experience. RDF is not particularly good at representing tabular data. The loss of the strong guarantees of the schema means that the data is now harder to understand and must be handled much more carefully.

The strong constraints inherent in the tabular model *are part of the data*. Conversion to RDF *discards* these constraints. Yes, sometimes discarding constraints is sensible for data integration (you might need to knock down some walls to extend a house), but it has consequences.

Best,
Richard


> hierarchical data and any other form of data, so to my mind there need not be any conflict between providing excellent support for tabular data AND being able to uniformly operate at the RDF level -- provided that tabular data can be predictably viewed as RDF.  Again, I'll try to explain more what I mean in my reply to Gregg and Andy.
> 
> Thanks,
> David Booth
Received on Tuesday, 4 March 2014 08:14:17 UTC