Re: Scoping Question from Raj Singh on 2014-02-24 (public-csv-wg@w3.org from February 2014)

From: Raj Singh <rsingh@opengeospatial.org>
Date: Mon, 24 Feb 2014 13:25:23 -0500
To: Yakov Shafranovich <yakov-ietf@shaftek.org>
Cc: Bill Roberts <bill@swirrl.com>, Jeni Tennison <jeni@jenitennison.com>, public-csv-wg@w3.org
Message-Id: <DACEFD0D-9709-4501-9E02-51CB7586FAED@opengeospatial.org>
I thought the lesson learned from the "Open Data on the Web" workshop [1] was that after decades of data formatting specifications in XML, RDF, etc., most data in the world does not use any of those formats. Therefore creating a new format will have minimal impact on web data linking activity. However, there is a great amount of structure inherent in those CSV files out in the wild that we need to embrace, and offer guidance on how to use. 

So I agree with Yakov's "extended" APPROACH 1. And our work over the next year should be on #2 and #3 below.

1. CSV on the web should be used as-is whenever possible
2. we should define some defaults (or assumptions) about CSV in the wild (e.g. the URI of the data set is the URL of the CSV file, and the URI for the field name is the field name appended to the data set URI)
   - this allows more advanced users to leverage data from less advanced users, e.g. to define linked data sets
3. an optional second file (or directory) would contain structured metadata about the CSV file that would override any assumptions

[1] http://www.w3.org/2013/04/odw/report

-----
Raj Singh
rsingh@opengeospatial.org
+1 (617) 642-9372
The OGC: Making location count.
http://www.opengeospatial.org/ogc/organization/staff/rsingh




On Feb 22, at 10:17 PM, Yakov Shafranovich <yakov-ietf@shaftek.org> wrote:

> (Been lurking for a while, I am the original author of RFC 4180 on CSV files)
> 
> I am tending to lean towards approach #1 as well which is what I tried
> to follow when originally writing up RFC 4180.
> 
> One possibility maybe a hybrid approach, where sort of a CSV-plus
> format exists that would be sufficiently compatible with what is
> already out there while adding some new features. An example of that
> would be leaving the main CSV file format intact, while perhaps
> narrowing down the specification format, and allowing a second file or
> directory to carry metadata about CSV. For those who do not want to
> use it, they will ignore the metadata but those who will, will end up
> using it. That is the approach we followed when developing the ARF
> format for reporting spam at the IETF.
> 
> For example, there was another thread discussing multiple "sheets" in
> a single CSV file. That can be accomplished by having some sort of a
> standard naming scheme plus a special metadata directory or file
> carrying data describing how those sheets related to each other, or
> even just a ZIP file convention as suggested by Chris Metcalf. Those
> users who do not wish to use it, will simply ignore the metadata
> directory or file, and consume the CSV files as they are. Others, will
> take advantage of the metadata and use it.
> 
> On the other hand, if we follow the approach suggested by Craig
> Russell, will those files with the split line breaks work for existing
> users?
> 
> Thanks,
> Yakov
> 
> On Fri, Feb 21, 2014 at 1:32 PM, Bill Roberts <bill@swirrl.com> wrote:
>> Hi Jeni
>> 
>> APPROACH 2 seems to me to be the only sensible option for the group to work
>> on.
>> 
>> The point (if I understand correctly) is to help to make it easy for people
>> to publish their tabular data in a better way, with some metadata about what
>> it all means, while using formats that non-specialist data consumers can
>> easily understand and use.
>> 
>> People will no doubt continue to publish all the messy and imperfect CSV
>> variants that are currently found in the wild.  But for those who care
>> enough to add some metadata and think about the semantics of what they are
>> publishing, then they may as well use a new CSV+ format in order to do it.
>> They already have to choose to take a step beyond 'thoughtless' CSV, so make
>> the task easy for consumers and get them to follow some standards.
>> 
>> The backward compatibility for consumers is important - i.e. it should be
>> possible to use the new format with the tools that people are familiar with
>> (Excel etc) and for people who want to ignore all the semantic metadata to
>> be able to do so.
>> 
>> If the new format is not (at least mostly) usable by tools that people
>> currently use for CSV, then not sure there is much point - there are plenty
>> of other formats available, such as all the variants of RDF, which will do
>> the job very nicely, except for the downside of not being well-supported by
>> tools of non-specialists!
>> 
>> The challenge of post-fitting structure and semantics to the messy CSV will
>> still be there and is an important problem, but it's a different problem I
>> think.
>> 
>> Best regards
>> 
>> Bill
>> 
>> 
>> 
>> 
>> 
>> On 21 Feb 2014, at 17:31, Jeni Tennison <jeni@jenitennison.com> wrote:
>> 
>> Hi,
>> 
>> [Only just got net connection to enable me to send this.]
>> 
>> A scoping question occurred to me during the call on Wednesday.
>> 
>> There seem to be two approaches that we should explicitly choose between.
>> 
>> APPROACH 1: Work with what's there
>> 
>> We are trying to create a description / metadata format that would enable us
>> to layer processing semantics over the top of all the various forms of
>> tabular data that people publish so that it can be interpreted in a standard
>> way.
>> 
>> We need to do a survey of what tabular data exists in its various formats so
>> that we know what the description / metadata format needs to describe. When
>> we find data that uses different separators, pads out the actual data using
>> empty rows and columns, incorporates two or more tables inside a single CSV
>> file, or uses Excel spreadsheets or DSPL packages or SDF packages or NetCDF
>> or the various other formats that people have invented, we need to keep note
>> of these so that whatever solution and processors we create will work with
>> these files.
>> 
>> APPROACH 2: Invent something new
>> 
>> We are trying to create a new format that would enable publishers to publish
>> tabular data in a more regular way while preserving the same meaning, to
>> make it easier for consumers of that data.
>> 
>> We need to do a survey of what tabular data exists so that we can see what
>> publishers are trying to say with their data, but the format that they are
>> currently publishing that data in is irrelevant because we are going to
>> invent a new format. When we find data that includes metadata about tables
>> and cells, or groups or has cross references between tables, or has columns
>> whose values are of different types, we need to keep note of these so that
>> we ensure the format we create can capture that meaning.
>> 
>> We also need to understand existing data so that we have a good backwards
>> compatibility story: it would be useful if the format we invent can be used
>> with existing tools, and if existing data didn't have to be changed very
>> much to put it into the new format. But there will certainly be files that
>> do have to be changed, and sometimes substantially.
>> 
>> 
>> My focus is definitely on the second approach as I think taking the first
>> approach is an endless and impossible task. But some recent mails and
>> discussion has made me think that some people are taking the first approach.
>> Any thoughts?
>> 
>> Cheers,
>> 
>> Jeni
>> --
>> Jeni Tennison
>> http://www.jenitennison.com/
>> 
>> 
> 
>
Received on Monday, 24 February 2014 18:25:51 UTC