Re: CSV2RDF and R2RML from Ivan Herman on 2014-02-19 (public-csv-wg@w3.org from February 2014)

From: Ivan Herman <ivan@w3.org>
Date: Wed, 19 Feb 2014 10:28:19 +0100
To: James McKinney <james@opennorth.ca>
CC: Andy Seaborne <andy@apache.org>, public-csv-wg@w3.org
Message-ID: <53047933.1020500@w3.org>

James McKinney wrote:
> 
> On 2014-02-18, at 6:50 AM, Ivan Herman wrote:
> 
>> my gut feeling tells me that
>> most of the CSV data out there are structurally simple (albeit possibly huge...)
> 
> The structure may be simple (to a human), but it is not consistent or easily
> guessable by a computer. For example, consider the headers:
> 
> Country,Population,2010,2011,2012,2013
> 
> A CSV that is optimized for computer-consumption would instead have a "year"
> header and repeat the country and population values for each year. The UN
> publishes lots of CSVs like this.
> 
> You can also have CSVs which are transpositions. For example:
> 
> Variable,Doctor Who,Sherlock,Dowton Abbey
> Genre,...
> Seasons,...
> Episodes,...
> Rating,...
> 
> All very simple for a human, but the variability really confuses a computer.
> 
> Another class of problems comes from the fact that many CSVs are created from
> within Excel, in which case you'll often have multiple tables on the same sheet,
> usually with empty cells between them. You'll also often have all sorts of
> presentation mixed in with the data. For example, many financial statements will
> have headers interspersed with the data. I don't know if this class of "messy
> tables" is in scope for the working group, though.

Sigh:-) I guess, for the time being, messy tables are not out of scope (yet?)...

What this tells me, though, is that there is only that much we can do on
providing clean data. At this moment we are talking about the conversion to
JSON, RDF, or XML or whatever: in all cases there is a level of cleanup that
_will_ be in the realm of the data consumer, no matter what. We should not try
to cover all the pathological cases...

To take the example above with

Country,Population,2010,2011,2012,2013

if the generated JSON is a simply copy of that, ie,

{
 "2000" : "true",
 "2010" : "false",
 ...
}

one can easily produce a post-processing program that transforms this data in a
more proper way for that specific case, but I have difficulties to imagine how
we would define some sort of a generic almost-turing-complete language to define
transformations in general... For this case even the @context of JSON-LD would
not help.

I guess what we may do is to analyze the use cases to see how frequent the
various pathological cases are, and we may then be able to add metadata
information signaling those. But we will not cover all.

As for the multiple tables with the same file: do you mean that the data is such
that its structure is not homogeneous, ie, that it is as if several csv files,
with different structures, were concatenated together? Now *that* is really messy:-(

Ivan

B.t.w., the my original remark referred to the 'foreign key' issue; ie, that we
can forget about that RDB terms for CSV... I hope that does hold although your
remark about several tables within the same CSV files made me scared.

> 
> James

Received on Wednesday, 19 February 2014 09:28:56 UTC