CSVs vs Tables from Jeni Tennison on 2014-06-02 (public-csv-wg@w3.org from June 2014)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Mon, 2 Jun 2014 20:01:36 +0100
To: public-csv-wg@w3.org
Cc: "rufus.pollock@okfn.org" <rufus.pollock@okfn.org>
Message-ID: <etPan.538cca10.520eedd1.5cbf@jenit.local>
Hi,

I discussed the scope of the metadata format with Rufus last week and it raised the question about what assumptions we can make about the “CSV” documents that the metadata refers to. The three options are something like:

1. We assume a 1:1 relationship between tables and CSV files (and between rows/columns/cells in the table and rows/columns/cells in the CSV file), and that the CSV files will be CSV+ (comma separated, UTF-8, no padding etc). This enables us to have a pretty simple metadata format that just includes something like:

  {
    @id: trees.csv,
    header: true,
    notes: {
      #row=3: { … }
      #cell=4,5: { … }
    }
  }

The only information that the metadata might need to contain to inform the parsing of the CSV file is whether or not it contains a (single) header line. The advantages are that this is simple and it encourages people to publish well-formed CSV files. The fact that the rows/columns/cells are 1:1 in the table and CSV file makes referencing them easy using fragment identifiers. The disadvantage is that it excludes a whole bunch of “CSV” files that aren’t well-formed.

2. We assume a 1:1 relationship between tables and CSV files (and between rows/columns/cells in the table and rows/columns/cells in the CSV file), but allow the “CSV” files some latitude in how they’re formed (specifically, enabling different separators & escape sequences, and different encodings, but not supporting padding). The metadata format then needs to have properties that provide some information about how the file should be parsed, so something like:

  {
    @id: trees.csv
    parsing: {
      header: true,
      encoding: ISO-8859-1,
      separator: ;,
      escape: \"
    }
    notes: {
      #row=3: { … }
      #cell=4,5: { … }
    }
  }

This is still fairly simple, but we do have to specify some parsing options. The fact that the rows/columns/cells are 1:1 in the table and CSV file makes referencing them easy using fragment identifiers. It gives people more latitude in the CSV files that they publish and enables metadata to be used to describe some (but not all) legacy tabular data.

3. We don’t assume a 1:1 relationship between tables and CSV files, but support the full set of parsing options that are outlined in the Tabular Model document. This means the references to both the source data file and the rows/columns/cells within it have to be done slightly differently, for example:

  {
    source: {
      @id: trees.csv#cell=3,4-31453,19 // a region of a CSV file
      header: true,
      encoding: ISO-8859-1,
      separator: ;,
      escape: \"
    }
    notes: [{
      type: row,
      number: 3,
      ...
    }, {
      type: cell,
      row: 4,
      column: 5,
      ...
    }]
  }

This is great in terms of being general purpose, but it involves more specification work and I think will be prone to confusion about the mismatch between row/column numbers in the “CSV” file vs those in the described table.


I am in favour of #1 or #2 and not doing #3. #2 seems to meet the use cases / requirements that we’ve gathered so far (excepting the deferred requirement for multiple heading rows). I wanted to double check that meets everyone’s expectations.

Jeni
--  
Jeni Tennison
http://www.jenitennison.com/
Received on Monday, 2 June 2014 19:02:03 UTC