Re: Scoping Question from Alfredo Serafini on 2014-02-21 (public-csv-wg@w3.org from February 2014)

From: Alfredo Serafini <seralf@gmail.com>
Date: Fri, 21 Feb 2014 17:45:25 +0100
To: Jeni Tennison <jeni@jenitennison.com>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CADawF4PnCGX6QfSRJtDU0pL+iM0U9HRTYGaeeqEp8ibk98gKAg@mail.gmail.com>

Hi

I think the second approach could engage everyone who has data in some
strange format in small rewrites in order to obtain something more portable
and (hopefully) with more robust meaning.

Otherwise the result I can see on the horizon adopting the first approach,
is "simply" to obtain an implementation of something similar to R2RML, but
for CSV. This can be interesting, but I think it can be very tricky if
someone wants to introduce even semantics for sparse tabular data and
similar scenarios, and it's probably far for wide adoption.

>From my point of view the best result to achieve is something similar to
json-ld in the CSV context: something which can be already ideally written
with existing tools, at the cost of some syntax rewrite, in order to be
used either from users who simply want to adopt a standardized syntax, and
from others who will adopt more advanced parsers to capture semantics.


Alfredo


2014-02-21 17:31 GMT+01:00 Jeni Tennison <jeni@jenitennison.com>:

> Hi,
>
> [Only just got net connection to enable me to send this.]
>
> A scoping question occurred to me during the call on Wednesday.
>
> There seem to be two approaches that we should explicitly choose between.
>
> APPROACH 1: Work with what’s there
>
> We are trying to create a description / metadata format that would enable
> us to layer processing semantics over the top of all the various forms of
> tabular data that people publish so that it can be interpreted in a
> standard way.
>
> We need to do a survey of what tabular data exists in its various formats
> so that we know what the description / metadata format needs to describe.
> When we find data that uses different separators, pads out the actual data
> using empty rows and columns, incorporates two or more tables inside a
> single CSV file, or uses Excel spreadsheets or DSPL packages or SDF
> packages or NetCDF or the various other formats that people have invented,
> we need to keep note of these so that whatever solution and processors we
> create will work with these files.
>
> APPROACH 2: Invent something new
>
> We are trying to create a new format that would enable publishers to
> publish tabular data in a more regular way while preserving the same
> meaning, to make it easier for consumers of that data.
>
> We need to do a survey of what tabular data exists so that we can see what
> publishers are trying to say with their data, but the format that they are
> currently publishing that data in is irrelevant because we are going to
> invent a new format. When we find data that includes metadata about tables
> and cells, or groups or has cross references between tables, or has columns
> whose values are of different types, we need to keep note of these so that
> we ensure the format we create can capture that meaning.
>
> We also need to understand existing data so that we have a good backwards
> compatibility story: it would be useful if the format we invent can be used
> with existing tools, and if existing data didn’t have to be changed very
> much to put it into the new format. But there will certainly be files that
> do have to be changed, and sometimes substantially.
>
>
> My focus is definitely on the second approach as I think taking the first
> approach is an endless and impossible task. But some recent mails and
> discussion has made me think that some people are taking the first
> approach. Any thoughts?
>
> Cheers,
>
> Jeni
> --
> Jeni Tennison
> http://www.jenitennison.com/
>
>

Received on Friday, 21 February 2014 16:45:53 UTC