- From: Dan Brickley <danbri@google.com>
- Date: Sun, 23 Feb 2014 13:10:04 -0800
- To: Jeni Tennison <jeni@jenitennison.com>
- Cc: Alfredo Serafini <seralf@gmail.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
On 23 February 2014 11:46, Jeni Tennison <jeni@jenitennison.com> wrote: > Hi Alfredo, > > I think I’m advocating an approach based on Postel’s Law: be conservative in what you send and liberal in what you accept. > > Parsing the near infinite variety of weird ways in which people are publishing tabular data is about being liberal in what you accept. Standardising how that tabular data is interpreted (with the understanding that some of the weird usages might lead to misinterpretation) would be helpful for interoperability, though from what I can tell most tools that deal with importing tabular data rely heavily on intelligent user configuration rather than sniffing (unlike HTML parsers). > > My suggestion for step 1 is about working out what is the conservative thing that should be sent. Defining this is useful to help publishers to move towards more regular publication of tabular data. I think it’s a simpler first step, but they can be done in parallel. There are a few levels of 'weird and wonderful' here. There's the basic entry-level horror show of trying to go from an unpredictably CSV-esque byte stream into a table without screwing up. I hope we can make some test-driven progress there. But even once you've got that table, and even it if it looks reassuringly regular, there are a billion and one ways in which it might be interestingly information-bearing. Here's a spreadsheet where the cells are pixels. https://docs.google.com/a/google.com/spreadsheet/ccc?key=0AveB4CyIeYEkdGRtbW9pYVhNU2VBZnZzeGV5eHhreEE&hl=en#gid=0 Another might be streamed skeleton data from an API to a Kinect camera or similar. Or 3d point cloud data, http://www.cansel.ca/en/our-blog/236-c3d-point-clouds Another might be classic 'northwinds database' entity relationship data. Another might be basically entity-relationship, but with hidden substructure, e.g. arrays or (svg etc.) path notations packed into table cells. etc. I think part of our job is to get a reasonable story about the basic bytes-to-tables situation, and document some useful subset of bytes that map well into tables. The IETF RFC is the best basis for this. For the subsequent part, I think there is interesting and useful work that can be done for _all_ tables, at a broad brush level of granularity. Even if the tabular content is "weird and wonderful", just writing down some basic per-CSV metadata (who made it, when, e.g. Dublin Core -esque / schema.org metadata, associated entities/topics, keywords, related file e.g. source XLS, associated organizations, previous versions...) all those things are useful. But many of us also want to go deeper and find ways, for a further subset of CSV, to do things like map rows in the CSV into edges in an RDF-based graph; i.e. to "Look Inside' the table. But I'd suggest we ought to also take care of a wider variety 'weird and wonderful' CSVs at the per-document level too. Re (1. Work with what’s there) and (2. Invent something new) I think we're looking for a notational "centre of gravity" as close to the mainstream of CSV usage as possible. And then we provide a framework for describing such tables firstly at the per-table level (no table left behind... if it's a table, it should be reasonable to say at least something about it), and then at the per column, row, and cell levels (many weirder tables left behind, or whose subtleties are only partly covered). So in these terms, I'm very much "work with what's out there" in terms of the notation, and the desire to help people describe their existing (often weird and crappy) tables; but beyond that, there is also "invent something new" holding the promise of making something that looks like mainstream CSV (plus an annotation mechanism) serve as a familiar looking notation for certain kinds of very modern and price factual data. The 'certain kinds of' will need to be driven by the use cases work, but my guess is that it'll look a lot like entity-relationship graphs perhaps with special case attention to the needs of statistical / time-series data. Dan
Received on Sunday, 23 February 2014 21:10:32 UTC