Re: substructure within fields of a CSV from Alf Eaton on 2014-03-13 (public-csv-wg@w3.org from March 2014)

From: Alf Eaton <eaton.alf@gmail.com>
Date: Thu, 13 Mar 2014 15:57:28 +0000
To: "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-ID: <CAJVrAaSoFMpLq8rUqcenxbL35M6Rc7=xVZOH__S6A9a0EeZvYw@mail.gmail.com>
On 11 March 2014 11:52, Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk> wrote:
> Hi – at the teleconf 2-weeks ago I mentioned challenges around dealing with
> sub-structure in CSV fields. We agreed to progress this offline … but I’ve
> been distracted J.
>
>
>
> Alf’s PLOS ONE search use case provides a good example. You’ll see a
> double-quote escaped list of authors (list delimiter is “,”) within a
> specific field; e.g.
>
>
>
> 10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic Antibiotics
> to Prevent Cellulitis of the Leg: Economic Analysis of the PATCH I & II
> Trials,"James M Mason,Kim S Thomas,Angela M Crook,Katharine A Foster,Joanne
> R Chalmers,Andrew J Nunn,Hywel C Williams"
>
>
>
> Do we have any thoughts on how to treat such sub-structure?
>
>
>
> Common examples that spring to mind where sub-structure is prevalent are:
>
>
>
> -          Date-time values; e.g. “2013-12-13T09:00Z” (xsd:dateTime)
>
> -          Geometries; e.g. “<http://www.opengis.net/def/crs/OGC/1.3/CRS84>
> Point(-3.405 50.737)” (geo:wktLiteral – “Well Known Text Literal” from
> GeoSPARQL)
>
>
>
> Clearly these both have sub-structure, but we treat them as atomic literals
> that _some applications_ (or most applications in the case of date-time) may
> know how to parse. It’s difficult to imagine how to express a generic
> description for sub-structure such as the author list in the example above.
>
> So my suggestion is to treat field values as atomic entities that cannot be
> decomposed further.

I have three (contradictory) responses, and I'm not sure which I prefer:

The first response is that I think this makes sense at the CSV parsing
level, as it leaves options open for specifying transformations at a
later point in the processing pipeline. For example, the "authors"
field in the PLOS ONE search example could even (in theory, as an
extreme case) contain JSON-structured names:

10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic
Antibiotics Trials,'[{"given-names":"James","family-name":"Mason"},{"given-names":"Kim","family-name":"Thomas"}]'

The data type ("json") of that field could be described somewhere, for
use when mapping the data to an object, but the CSV parser need not
know about it in order to extract the field as a string.

The second, simpler, response is that actually the data should have a
repeated "author" field, rather than a comma-separated "authors" field
(and, correspondingly, a "location" field should actually be separate
"latitude" and "longitude" fields).

The third option is that the author names should not be in that
response at all, but should be in a separate set of data that contains
one row per author, with a key for the article, similar to how the
data would be represented in a relational database.

> However, we’ve previously talked about “replicates” (e.g. where a field
> contains many values). Do we need to develop some guidance on how to express
> “replicates” (such as a list of Authors) in a CSV file?

Another type of "replicate" data is where the table contains repeated
measurements of a particular subject (e.g. taking three heart rate
pulse measurements in order to get an average measurement, from each
of multiple individuals). If the data is expressed as one row per
measurement then it's not a problem, but if the data is collapsed to
one row per subject, then there are repeated values that either need
to be expressed as multiple columns with the same label, or as a
single cell containing a list of measurements.

On a related note, I've tried to make a spreadsheet that shows all the
different ways a set of (made-up) data might need to be represented,
for different types of analysis or visualisation: http://goo.gl/lzXyPO

The most granular data (essentially one row per measurement, with many
columns for labels and a single column for the measurement) is the
most re-usable, but often data is published in a less re-usable table
that is more human-readable. I'm not quite sure how to explain that
well, but I think that basically a "model for tabular data" should
allow all the 2D tabular representations of the data to be generated
on the fly. This probably fits with the model as described already,
where each row or column label is an annotation on the data point;
from that information, the 2D table used to present the information
can be re-organised as needed.

Alf
Received on Thursday, 13 March 2014 15:58:16 UTC