Re: Scoping Question from Dan Brickley on 2014-02-23 (public-csv-wg@w3.org from February 2014)

From: Dan Brickley <danbri@google.com>
Date: Sun, 23 Feb 2014 13:10:04 -0800
To: Jeni Tennison <jeni@jenitennison.com>
Cc: Alfredo Serafini <seralf@gmail.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAK-qy=7DdZYT11FJi_JztC4PMjFq=5wYcDeAEVijQAxWuHXvsA@mail.gmail.com>
On 23 February 2014 11:46, Jeni Tennison <jeni@jenitennison.com> wrote:
> Hi Alfredo,
>
> I think I’m advocating an approach based on Postel’s Law: be conservative in what you send and liberal in what you accept.
>
> Parsing the near infinite variety of weird ways in which people are publishing tabular data is about being liberal in what you accept. Standardising how that tabular data is interpreted (with the understanding that some of the weird usages might lead to misinterpretation) would be helpful for interoperability, though from what I can tell most tools that deal with importing tabular data rely heavily on intelligent user configuration rather than sniffing (unlike HTML parsers).
>
> My suggestion for step 1 is about working out what is the conservative thing that should be sent. Defining this is useful to help publishers to move towards more regular publication of tabular data. I think it’s a simpler first step, but they can be done in parallel.

There are a few levels of 'weird and wonderful' here. There's the
basic entry-level horror show of trying to go from an unpredictably
CSV-esque byte stream into a table without screwing up. I hope we can
make some test-driven progress there. But even once you've got that
table, and even it if it looks reassuringly regular, there are a
billion and one ways in which it might be interestingly
information-bearing.

Here's a spreadsheet where the cells are pixels.
https://docs.google.com/a/google.com/spreadsheet/ccc?key=0AveB4CyIeYEkdGRtbW9pYVhNU2VBZnZzeGV5eHhreEE&hl=en#gid=0

Another might be streamed skeleton data from an API to a Kinect camera
or similar.

Or 3d point cloud data, http://www.cansel.ca/en/our-blog/236-c3d-point-clouds

Another might be classic 'northwinds database' entity relationship data.

Another might be basically entity-relationship, but with hidden
substructure, e.g. arrays or (svg etc.) path notations packed into
table cells.

etc.

I think part of our job is to get a reasonable story about the basic
bytes-to-tables situation, and document some useful subset of bytes
that map well into tables. The IETF RFC is the best basis for this.
For the subsequent part, I think there is interesting and useful work
that can be done for _all_ tables, at a broad brush level of
granularity. Even if the tabular content is "weird and wonderful",
just writing down some basic per-CSV metadata (who made it, when, e.g.
Dublin Core -esque / schema.org metadata, associated entities/topics,
keywords, related file e.g. source XLS, associated organizations,
previous versions...) all those things are useful. But many of us also
want to go deeper and find ways, for a further subset of CSV, to do
things like map rows in the CSV into edges in an RDF-based graph; i.e.
to "Look Inside' the table. But I'd suggest we ought to also take care
of a wider variety 'weird and wonderful' CSVs at the per-document
level too.

Re (1.  Work with what’s there) and (2. Invent something new) I think
we're looking for a notational "centre of gravity" as close to the
mainstream of CSV usage as possible. And then we provide a framework
for describing such tables firstly at the per-table level (no table
left behind... if it's a table, it should be reasonable to say at
least something about it), and then at the per column, row, and cell
levels (many weirder tables left behind, or whose subtleties are only
partly covered). So in these terms, I'm very much "work with what's
out there" in terms of the notation, and the desire to help people
describe their existing (often weird and crappy) tables; but beyond
that, there is also "invent something new" holding the promise of
making something that looks like mainstream CSV (plus an annotation
mechanism) serve as a familiar looking notation for certain kinds of
very modern and price factual data. The 'certain kinds of' will need
to be driven by the use cases work, but my guess is that it'll look a
lot like entity-relationship graphs perhaps with special case
attention to the needs of statistical / time-series data.

Dan
Received on Sunday, 23 February 2014 21:10:32 UTC