Re: CSV2RDF and R2RML from James McKinney on 2014-02-18 (public-csv-wg@w3.org from February 2014)

From: James McKinney <james@opennorth.ca>
Date: Tue, 18 Feb 2014 12:42:21 -0500
To: Ivan Herman <ivan@w3.org>
Cc: Andy Seaborne <andy@apache.org>, public-csv-wg@w3.org
Message-Id: <1801791E-808F-42A5-BE3C-2E24D46C38CA@opennorth.ca>

On 2014-02-18, at 6:50 AM, Ivan Herman wrote:

> my gut feeling tells me that
> most of the CSV data out there are structurally simple (albeit possibly huge...)

The structure may be simple (to a human), but it is not consistent or easily guessable by a computer. For example, consider the headers:

Country,Population,2010,2011,2012,2013

A CSV that is optimized for computer-consumption would instead have a "year" header and repeat the country and population values for each year. The UN publishes lots of CSVs like this.

You can also have CSVs which are transpositions. For example:

Variable,Doctor Who,Sherlock,Dowton Abbey
Genre,...
Seasons,...
Episodes,...
Rating,...

All very simple for a human, but the variability really confuses a computer.

Another class of problems comes from the fact that many CSVs are created from within Excel, in which case you'll often have multiple tables on the same sheet, usually with empty cells between them. You'll also often have all sorts of presentation mixed in with the data. For example, many financial statements will have headers interspersed with the data. I don't know if this class of "messy tables" is in scope for the working group, though.

James

Received on Tuesday, 18 February 2014 17:42:49 UTC