- From: Alf Eaton <eaton.alf@gmail.com>
- Date: Thu, 13 Mar 2014 15:57:28 +0000
- To: "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>
- Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
On 11 March 2014 11:52, Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk> wrote: > Hi – at the teleconf 2-weeks ago I mentioned challenges around dealing with > sub-structure in CSV fields. We agreed to progress this offline … but I’ve > been distracted J. > > > > Alf’s PLOS ONE search use case provides a good example. You’ll see a > double-quote escaped list of authors (list delimiter is “,”) within a > specific field; e.g. > > > > 10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic Antibiotics > to Prevent Cellulitis of the Leg: Economic Analysis of the PATCH I & II > Trials,"James M Mason,Kim S Thomas,Angela M Crook,Katharine A Foster,Joanne > R Chalmers,Andrew J Nunn,Hywel C Williams" > > > > Do we have any thoughts on how to treat such sub-structure? > > > > Common examples that spring to mind where sub-structure is prevalent are: > > > > - Date-time values; e.g. “2013-12-13T09:00Z” (xsd:dateTime) > > - Geometries; e.g. “<http://www.opengis.net/def/crs/OGC/1.3/CRS84> > Point(-3.405 50.737)” (geo:wktLiteral – “Well Known Text Literal” from > GeoSPARQL) > > > > Clearly these both have sub-structure, but we treat them as atomic literals > that _some applications_ (or most applications in the case of date-time) may > know how to parse. It’s difficult to imagine how to express a generic > description for sub-structure such as the author list in the example above. > > So my suggestion is to treat field values as atomic entities that cannot be > decomposed further. I have three (contradictory) responses, and I'm not sure which I prefer: The first response is that I think this makes sense at the CSV parsing level, as it leaves options open for specifying transformations at a later point in the processing pipeline. For example, the "authors" field in the PLOS ONE search example could even (in theory, as an extreme case) contain JSON-structured names: 10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic Antibiotics Trials,'[{"given-names":"James","family-name":"Mason"},{"given-names":"Kim","family-name":"Thomas"}]' The data type ("json") of that field could be described somewhere, for use when mapping the data to an object, but the CSV parser need not know about it in order to extract the field as a string. The second, simpler, response is that actually the data should have a repeated "author" field, rather than a comma-separated "authors" field (and, correspondingly, a "location" field should actually be separate "latitude" and "longitude" fields). The third option is that the author names should not be in that response at all, but should be in a separate set of data that contains one row per author, with a key for the article, similar to how the data would be represented in a relational database. > However, we’ve previously talked about “replicates” (e.g. where a field > contains many values). Do we need to develop some guidance on how to express > “replicates” (such as a list of Authors) in a CSV file? Another type of "replicate" data is where the table contains repeated measurements of a particular subject (e.g. taking three heart rate pulse measurements in order to get an average measurement, from each of multiple individuals). If the data is expressed as one row per measurement then it's not a problem, but if the data is collapsed to one row per subject, then there are repeated values that either need to be expressed as multiple columns with the same label, or as a single cell containing a list of measurements. On a related note, I've tried to make a spreadsheet that shows all the different ways a set of (made-up) data might need to be represented, for different types of analysis or visualisation: http://goo.gl/lzXyPO The most granular data (essentially one row per measurement, with many columns for labels and a single column for the measurement) is the most re-usable, but often data is published in a less re-usable table that is more human-readable. I'm not quite sure how to explain that well, but I think that basically a "model for tabular data" should allow all the 2D tabular representations of the data to be generated on the fly. This probably fits with the model as described already, where each row or column label is an annotation on the data point; from that information, the 2D table used to present the information can be re-organised as needed. Alf
Received on Thursday, 13 March 2014 15:58:16 UTC