RE: substructure within fields of a CSV

Hi Alf - I completely agree with your point:

"""The most granular data (essentially one row per measurement, with many columns for labels and a single column for the measurement) is the most re-usable, but often data is published in a less re-usable table that is more human-readable."""

As I've been looking at CSV data I've noticed _many_ things that I wouldn't do if I was writing for machine parsing, but are convenient for humans to read.

I wonder if we should provide pre-parsers that interpret a given type of microsyntax (like the author list in the example) and "unbundle" the "replicates" to "one row per author" ... which then can be more easily processed by a general CSV parser?

Hmmm. Need some examples here. But won't be able to provide them in the short term due to other work commitments.

Jeremy

-----Original Message-----
From: Alf Eaton [mailto:eaton.alf@gmail.com] 
Sent: 13 March 2014 15:57
To: Tandy, Jeremy
Cc: W3C CSV on the Web Working Group
Subject: Re: substructure within fields of a CSV

On 11 March 2014 11:52, Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk> wrote:
> Hi – at the teleconf 2-weeks ago I mentioned challenges around dealing 
> with sub-structure in CSV fields. We agreed to progress this offline … 
> but I’ve been distracted J.
>
>
>
> Alf’s PLOS ONE search use case provides a good example. You’ll see a 
> double-quote escaped list of authors (list delimiter is “,”) within a 
> specific field; e.g.
>
>
>
> 10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic 
> Antibiotics to Prevent Cellulitis of the Leg: Economic Analysis of the 
> PATCH I & II Trials,"James M Mason,Kim S Thomas,Angela M 
> Crook,Katharine A Foster,Joanne R Chalmers,Andrew J Nunn,Hywel C Williams"
>
>
>
> Do we have any thoughts on how to treat such sub-structure?
>
>
>
> Common examples that spring to mind where sub-structure is prevalent are:
>
>
>
> -          Date-time values; e.g. “2013-12-13T09:00Z” (xsd:dateTime)
>
> -          Geometries; e.g. “<http://www.opengis.net/def/crs/OGC/1.3/CRS84>
> Point(-3.405 50.737)” (geo:wktLiteral – “Well Known Text Literal” from
> GeoSPARQL)
>
>
>
> Clearly these both have sub-structure, but we treat them as atomic 
> literals that _some applications_ (or most applications in the case of 
> date-time) may know how to parse. It’s difficult to imagine how to 
> express a generic description for sub-structure such as the author list in the example above.
>
> So my suggestion is to treat field values as atomic entities that 
> cannot be decomposed further.

I have three (contradictory) responses, and I'm not sure which I prefer:

The first response is that I think this makes sense at the CSV parsing level, as it leaves options open for specifying transformations at a later point in the processing pipeline. For example, the "authors"
field in the PLOS ONE search example could even (in theory, as an extreme case) contain JSON-structured names:

10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic
Antibiotics Trials,'[{"given-names":"James","family-name":"Mason"},{"given-names":"Kim","family-name":"Thomas"}]'

The data type ("json") of that field could be described somewhere, for use when mapping the data to an object, but the CSV parser need not know about it in order to extract the field as a string.

The second, simpler, response is that actually the data should have a repeated "author" field, rather than a comma-separated "authors" field (and, correspondingly, a "location" field should actually be separate "latitude" and "longitude" fields).

The third option is that the author names should not be in that response at all, but should be in a separate set of data that contains one row per author, with a key for the article, similar to how the data would be represented in a relational database.

> However, we’ve previously talked about “replicates” (e.g. where a 
> field contains many values). Do we need to develop some guidance on 
> how to express “replicates” (such as a list of Authors) in a CSV file?

Another type of "replicate" data is where the table contains repeated measurements of a particular subject (e.g. taking three heart rate pulse measurements in order to get an average measurement, from each of multiple individuals). If the data is expressed as one row per measurement then it's not a problem, but if the data is collapsed to one row per subject, then there are repeated values that either need to be expressed as multiple columns with the same label, or as a single cell containing a list of measurements.

On a related note, I've tried to make a spreadsheet that shows all the different ways a set of (made-up) data might need to be represented, for different types of analysis or visualisation: http://goo.gl/lzXyPO


The most granular data (essentially one row per measurement, with many columns for labels and a single column for the measurement) is the most re-usable, but often data is published in a less re-usable table that is more human-readable. I'm not quite sure how to explain that well, but I think that basically a "model for tabular data" should allow all the 2D tabular representations of the data to be generated on the fly. This probably fits with the model as described already, where each row or column label is an annotation on the data point; from that information, the 2D table used to present the information can be re-organised as needed.

Alf

Received on Thursday, 13 March 2014 19:41:33 UTC