substructure within fields of a CSV from Tandy, Jeremy on 2014-03-11 (public-csv-wg@w3.org from March 2014)

From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
Date: Tue, 11 Mar 2014 11:52:09 +0000
To: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-ID: <2624871D9A05174691BD59F8EFD68AE2B4925B@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk>

Hi - at the teleconf 2-weeks ago<http://www.w3.org/2014/02/26-csvw-minutes.html> I mentioned challenges around dealing with sub-structure in CSV fields. We agreed to progress this offline ... but I've been distracted :).

Alf's PLOS ONE search use case<https://www.w3.org/2013/csvw/wiki/Use_Cases#A_local_archive_of_metadata_for_a_collection_of_journal_articles> provides a good example. You'll see a double-quote escaped list of authors (list delimiter is ",") within a specific field; e.g.

10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic Antibiotics to Prevent Cellulitis of the Leg: Economic Analysis of the PATCH I & II Trials,"James M Mason,Kim S Thomas,Angela M Crook,Katharine A Foster,Joanne R Chalmers,Andrew J Nunn,Hywel C Williams"

Do we have any thoughts on how to treat such sub-structure?

Common examples that spring to mind where sub-structure is prevalent are:


-          Date-time values; e.g. "2013-12-13T09:00Z" (xsd:dateTime)

-          Geometries; e.g. "<http://www.opengis.net/def/crs/OGC/1.3/CRS84> Point(-3.405 50.737)" (geo:wktLiteral - "Well Known Text Literal" from GeoSPARQL)

Clearly these both have sub-structure, but we treat them as atomic literals that _some applications_ (or most applications in the case of date-time) may know how to parse. It's difficult to imagine how to express a generic description for sub-structure such as the author list in the example above.

So my suggestion is to treat field values as atomic entities that cannot be decomposed further.

However, we've previously talked about "replicates" (e.g. where a field contains many values). Do we need to develop some guidance on how to express "replicates" (such as a list of Authors) in a CSV file?

Jeni's LinkedCSV proposal<http://jenit.github.io/linked-csv/> provides a method for dealing with this situation where the "replicates" are spread across multiple rows - each of which describe the same entity

(see example 15, where country "AD" <http://en.wikipedia.org/wiki/Andorra> is given names "Andorra" and "Principality of Andorra" across two rows ...


#,   $id,                                     country,english name,                   french name

,    http://en.wikipedia.org/wiki/Andorra,    AD,     Andorra,                        Andorre

,    http://en.wikipedia.org/wiki/Andorra,    ,       Principality of Andorra,

)

An alternative approach would be to recommend a syntax for expressing lists within a field???

Are there any other common types of sub-structure we should consider???

Thoughts welcome.

Received on Tuesday, 11 March 2014 11:52:40 UTC