Re: CSV+ Direct Mapping candidate? from David Booth on 2014-03-04 (public-csv-wg@w3.org from March 2014)

From: David Booth <david@dbooth.org>
Date: Tue, 04 Mar 2014 10:38:28 -0500
To: Richard Cyganiak <richard@cyganiak.de>
CC: Niklas Lindström <lindstream@gmail.com>, Gregg Kellogg <gregg@greggkellogg.net>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <5315F374.6090804@dbooth.org>
Hi Richard,

On 03/04/2014 03:13 AM, Richard Cyganiak wrote:
> On 4 Mar 2014, at 02:54, David Booth <david@dbooth.org> wrote:
>> On 03/01/2014 09:59 AM, Richard Cyganiak wrote:
>>> David,
>>>
>>> Let me first add one more clarification. I don't think of a
>>> Tarql mapping as a CSV-to-RDF mapping. I think of it as a
>>> logical-table-to-RDF mapping. Whether the table comes from CSV,
>>> TSV, SAS, SPSS or relational doesn't matter, as long as we define
>>> a sensible mapping from each of these syntaxes to a table of RDF
>>> terms with named columns. These mappings are generally easy to
>>> define, lossless, and don't add much arbitrary extra
>>> information.
>>>
>>> It's worth pointing out that such a table of RDF terms with
>>> named columns is the same logical structure as a SPARQL result,
>>> so it's something that is already supported to various degrees in
>>> most RDF toolkits.
>>>
>>> Now, you said:
>>>
>>>> The overall goal is to be able to predictably view all data
>>>> uniformly as RDF.
>>>
>>> I ask: why? What is the reason for wanting to do that?
>>
>> Obviously the goal is for information integration -- to make use of
>> RDF's value proposition.
>
> But that usually requires integration on the schema level, not just
> on the data model level. So, it requires going all the way to C, not
> just to B.
>
> Without schema-level integration, all you can do really is explore
> the data to some extent. And while that's a good first step towards
> achieving full integration, my experience is that you can’t really
> use the data in applications unless you go all the way to C.

Fully agreed.  But my point is that B is merely intended to expose the 
meaning of the data in RDF -- independent of consuming applications -- 
whereas C is application specific, and will differ from application to 
application.  The data publisher understands the meaning of the data and 
can therefore provide metadata that enables others to view the data as 
B, but cannot be expected to know what C should be (or C1, C2, C3, etc. 
for many applications).

> At the
> same time, by going from A to B, you’ve lost the ability to use the
> data in its native environment (e.g., you can’t open it in Excel or
> Refine or SPSS). So, in my eyes, you’re in the negative at B, and
> only come out ahead once you get pretty close to C.

Sorry, I didn't mean to imply that the table should not be published in 
its original CSV (or whatever) format as well.  Certainly that should 
still be published.  But it would be good if the publisher would *also* 
publish associated metadata that allows consumers to *view* that table 
as a form of RDF, by using the metadata to map it to B.  A consumer can 
then map B to its own model C, all within RDF.

>
>>> Especially, why would you dissolve a cleanly structured table
>>> into a soup of triples by inserting meaningless and arbitrary
>>> extra elements? What do you win by doing that?
>>
>> Quite the opposite.  The goal is to expose the table's intended
>> information as RDF -- no more and no less.  It is not to insert
>> meaningless or arbitrary extra elements,
>
> But it requires some arbitrary decisions. There are many ways to turn
> the same table into different graphs. How do you turn column names
> into property URIs? Is each cell a resource, or just the rows? And
> while the answers to these questions can be standardised (as you
> propose), they *do* insert extra information that isn’t a meaningful
> part of the original data, but that the query writer now needs to be
> aware of.

Yes, especially in the absence of associated metadata, that may happen. 
  And even if perfect metadata is published and used, there will still 
be arbitrary artifacts intrinsic to the use of RDF.  As long as we're 
using RDF, we cannot do anything about that.   When the successor to RDF 
is invented, hopefully it will result in fewer such artifacts, but 
AFAICT, RDF is the best we've got at the moment if we want to integrate 
the data with other data, which I do.

>
>> nor to discard potentially meaningful information.
>
> Well, I will argue further below that it *does* discard meaningful
> information—the strong constraints inherent in the tabular
> structure.
>
>>> I observe that if your toolkit supports SPARQL results and
>>> Tarql, then it can already display tables of RDF terms, query
>>> them, and transform them to arbitrary graphs according to a
>>> mapping. What else would you want to do with them that requires
>>> the direct-mapping-to-triple-soup?
>>
>> Semantic transformations: transforming from one data model to
>> another. This is almost always needed when integrating data.
>
> Are you saying that you only want to go from A to B as an
> intermediate step to C? If that is the case, then I maintain my view
> that going through B is an unnecessary complication, and the focus
> should be on getting from A to C with as few intermediate steps as
> possible.
>
> Are there other reasons why you want the direct mapped triples
> *besides* writing RDF-to-RDF mappings against them?

Yes, the point is that the publisher knows B, because it reflect the 
intended information content of the table, in RDF form.  But only the 
consumer can know C, because it varies from application to application.

>
>>> I think I'd rather like to see an RDF ecosystem with excellent
>>> support for tables as well as graphs, while you'd rather like to
>>> see an RDF ecosystem where everyone treats everything as a graph,
>>> even if it's not a graph.
>>
>> I want to decouple syntactic lift from semantic transformations, so
>> that all semantic transformations can be uniformly performed in
>> RDF.  I would rather not have to deal with a different semantic
>> transformation language for each data format.
>
> Okay, this explanation makes sense to me.
>
> But I think this again mischaracterises the Tarql approach by
> implying that it’s part of an agenda that requires a different
> transformation language for each data format.
>
> First, the Tarql approach is not for *one* data format, but easily
> addresses a large number of data formats, and can easily address even
> more with slight extensions. That’s because it is a transformation
> language for *logical tables*. The majority of data formats out there
> work with such an abstract tabular model.

Yes, that's something I like about Tarql.

>
> Second, Tarql is a syntactic subset of SPARQL (while it should
> probably be a small superset), and just requires some slight
> additions to SPARQL’s semantics. So, it’s not a new language. Just
> for context, the difference between SPARQL and Tarql is tiny compared
> to the difference between SPARQL 1.0 vs 1.1.

Yes, I think Tarql was quite an insightful piece of work.

>
>> I prefer to factor out the task of syntactic lift, from the task of
>> semantic transformation, so that: (a) the same semantic
>> transformation language and tools can be used regardless of the
>> data's source format; and (b) model bias will not be introduced
>> into RDF that is exposed.  I'll explain more what I mean in a
>> separate reply to Gregg and Andy.
>>
>> As you know, RDF can perfectly well represent tabular data,
>
> This is not true in my experience. RDF is not particularly good at
> representing tabular data. The loss of the strong guarantees of the
> schema means that the data is now harder to understand and must be
> handled much more carefully.
>
> The strong constraints inherent in the tabular model *are part of the
> data*. Conversion to RDF *discards* these constraints. Yes, sometimes
> discarding constraints is sensible for data integration (you might
> need to knock down some walls to extend a house), but it has
> consequences.

Again, I didn't mean to imply that the original table format should be 
discarded, but that a standard, deterministic lift to RDF should *also* 
be enabled, and ideally that lift should be only to B -- not to C.

David Booth
Received on Tuesday, 4 March 2014 15:38:57 UTC