RE: Scoping Question from Tandy, Jeremy on 2014-02-24 (public-csv-wg@w3.org from February 2014)

From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
Date: Mon, 24 Feb 2014 11:13:54 +0000
To: Dan Brickley <danbri@google.com>, Jeni Tennison <jeni@jenitennison.com>
CC: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <2624871D9A05174691BD59F8EFD68AE2B35490@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk>
All - great discussion.

+1 with the approach summarised by Jeni / Dan.

I think this approach is (in part) to determine what "well structured" tabular data published in _text files_ looks like (a best practice for publishing textual tabular data?), how to group sets of these for convenient publication and how to supplement the data files with additional metadata to support wider usage.

I think that, given the choice, many people would choose to create "standard" CSV data (4 or 5-star! Thanks Eric) because it increases the utility of their data - so long as it works within their tools. Trouble is, people only do what they know.

I note the intent to stick with text-based Tabular data. Stasinos (& others) have previously mentioned NetCDF-encoded tabular data. I would be concerned about extending the remit of this group out to include these other encodings. Specifically with NetCDF, I would encourage people to contribute to the work in Open Geospatial Consortium (where NetCDF has been defined as a standard). Other examples include HDF5 ... pandora's box, methinks. 

Jeremy

-----Original Message-----
From: Dan Brickley [mailto:danbri@google.com] 
Sent: 23 February 2014 21:10
To: Jeni Tennison
Cc: Alfredo Serafini; public-csv-wg@w3.org
Subject: Re: Scoping Question

On 23 February 2014 11:46, Jeni Tennison <jeni@jenitennison.com> wrote:
> Hi Alfredo,
>
> I think I’m advocating an approach based on Postel’s Law: be conservative in what you send and liberal in what you accept.
>
> Parsing the near infinite variety of weird ways in which people are publishing tabular data is about being liberal in what you accept. Standardising how that tabular data is interpreted (with the understanding that some of the weird usages might lead to misinterpretation) would be helpful for interoperability, though from what I can tell most tools that deal with importing tabular data rely heavily on intelligent user configuration rather than sniffing (unlike HTML parsers).
>
> My suggestion for step 1 is about working out what is the conservative thing that should be sent. Defining this is useful to help publishers to move towards more regular publication of tabular data. I think it’s a simpler first step, but they can be done in parallel.

There are a few levels of 'weird and wonderful' here. There's the basic entry-level horror show of trying to go from an unpredictably CSV-esque byte stream into a table without screwing up. I hope we can make some test-driven progress there. But even once you've got that table, and even it if it looks reassuringly regular, there are a billion and one ways in which it might be interestingly information-bearing.

Here's a spreadsheet where the cells are pixels.
https://docs.google.com/a/google.com/spreadsheet/ccc?key=0AveB4CyIeYEkdGRtbW9pYVhNU2VBZnZzeGV5eHhreEE&hl=en#gid=0


Another might be streamed skeleton data from an API to a Kinect camera or similar.

Or 3d point cloud data, http://www.cansel.ca/en/our-blog/236-c3d-point-clouds


Another might be classic 'northwinds database' entity relationship data.

Another might be basically entity-relationship, but with hidden substructure, e.g. arrays or (svg etc.) path notations packed into table cells.

etc.

I think part of our job is to get a reasonable story about the basic bytes-to-tables situation, and document some useful subset of bytes that map well into tables. The IETF RFC is the best basis for this.
For the subsequent part, I think there is interesting and useful work that can be done for _all_ tables, at a broad brush level of granularity. Even if the tabular content is "weird and wonderful", just writing down some basic per-CSV metadata (who made it, when, e.g.
Dublin Core -esque / schema.org metadata, associated entities/topics, keywords, related file e.g. source XLS, associated organizations, previous versions...) all those things are useful. But many of us also want to go deeper and find ways, for a further subset of CSV, to do things like map rows in the CSV into edges in an RDF-based graph; i.e.
to "Look Inside' the table. But I'd suggest we ought to also take care of a wider variety 'weird and wonderful' CSVs at the per-document level too.

Re (1.  Work with what’s there) and (2. Invent something new) I think we're looking for a notational "centre of gravity" as close to the mainstream of CSV usage as possible. And then we provide a framework for describing such tables firstly at the per-table level (no table left behind... if it's a table, it should be reasonable to say at least something about it), and then at the per column, row, and cell levels (many weirder tables left behind, or whose subtleties are only partly covered). So in these terms, I'm very much "work with what's out there" in terms of the notation, and the desire to help people describe their existing (often weird and crappy) tables; but beyond that, there is also "invent something new" holding the promise of making something that looks like mainstream CSV (plus an annotation
mechanism) serve as a familiar looking notation for certain kinds of very modern and price factual data. The 'certain kinds of' will need to be driven by the use cases work, but my guess is that it'll look a lot like entity-relationship graphs perhaps with special case attention to the needs of statistical / time-series data.

Dan
Received on Monday, 24 February 2014 11:14:22 UTC