Re: Scoping Question

Hi Alfredo,

I think I’m advocating an approach based on Postel’s Law: be conservative in what you send and liberal in what you accept.

Parsing the near infinite variety of weird ways in which people are publishing tabular data is about being liberal in what you accept. Standardising how that tabular data is interpreted (with the understanding that some of the weird usages might lead to misinterpretation) would be helpful for interoperability, though from what I can tell most tools that deal with importing tabular data rely heavily on intelligent user configuration rather than sniffing (unlike HTML parsers).

My suggestion for step 1 is about working out what is the conservative thing that should be sent. Defining this is useful to help publishers to move towards more regular publication of tabular data. I think it’s a simpler first step, but they can be done in parallel.

Jeni

------------------------------------------------------
From: Alfredo Serafini seralf@gmail.com
Reply: Alfredo Serafini seralf@gmail.com
Date: 23 February 2014 at 10:48:24
To: Jeni Tennison jeni@jenitennison.com
Subject:  Re: Scoping Question

>  
> Hi
> I have a question about the second step: don't you think that this  
> should
> involve describing the basic steps for a parser?
> I think about what has been done with HTML5, where the parsing  
> model is
> described along with the syntax specification. Moreover there  
> are tools
> like Silk (which recently work on CSV as input too, if I'm not wrong)  
> or
> Open Refine which can suggest strategies for a parsing algorithm.  
>  
> I ask just to be sure I am understanding the directions.
>  
> Alfredo
>  
>  
>  
> 2014-02-23 18:57 GMT+01:00 Jeni Tennison :  
>  
> > I agree with what Bill says below, but I do think that it’s worth  
> > separating out the concerns, which I think we can do by:
> >
> > Step 1. Defining what we think should be best practice for publishing  
> > tabular data, covering the desired semantics while trying  
> to be as
> > backwards-compatible as possible.
> >
> > Step 2. Defining how the existing weird and wonderful ways in  
> which people
> > use CSVs can be mapped into that best practice format.
> >
> > That’s basically the approach that I’ve taken in the syntax  
> document, but
> > I haven’t covered all the weird and wonderful stuff that we’ve  
> seen in some
> > of the example CSVs, such as tables padded with empty leading  
> rows and
> > columns, or tables that start after a number of lines of metadata,  
> or files
> > that contain several tables in one.
> >
> > I’m happy to continue with it, but are there any volunteers for  
> taking
> > over editing of the model & syntax document?
> >
> > More importantly, are there any volunteers who would like to  
> start work on
> > a separate document for Step 2 as described above?
> >
> > Cheers,
> >
> > Jeni
> >
> > ------------------------------------------------------  
> > From: Bill Roberts bill@swirrl.com
> > Reply: Bill Roberts bill@swirrl.com
> > Date: 23 February 2014 at 07:17:41
> > To: Jeni Tennison jeni@jenitennison.com
> > Subject: Re: Scoping Question
> >
> > >
> > > Reading the various contributions to the scoping discussion,  
> > > I'm not sure there is much difference in practice between the  
> > > two 'camps'.
> > >
> > > In approach 1, we're talking about encouraging users to publish  
> > > their CSV files in a particular style or following some set  
> of
> > > best practices that the group will recommend, so perhaps using  
> > > a subset of all the CSV approaches seen in the wild. Then adding  
> > > metadata of some form to explain structure, semantics etc  
> > >
> > > In approach 2, we're talking about creating a new CSV-like  
> format
> > > which is backward compatible with existing CSV tools, which  
> > > might end up looking like a dialect of current CSV plus a way  
> to
> > > specify metadata....
> > >
> > > In Jeni's initial posing of the question, she seems to be associating  
> > > with 'Approach 1' the objective to handle all kinds of CSV as  
> seen
> > > in the wild, which creates some challenges in making the metadata  
> > > format flexible enough. None of the responses I've seen seem  
> > > to be suggesting that we should support *all* kinds of CSV file.  
> > >
> > > A solution that uses existing CSV features in a particular  
> way,
> > > plus a metadata format of some sort, seems close to a consensus,  
> > > if I interpret everyone's comments/intentions correctly.  
> > >
> > > Regards
> > >
> > > Bill
> > >
> > >
> > > On 21 Feb 2014, at 17:31, Jeni Tennison
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > [Only just got net connection to enable me to send this.]
> > > >
> > > > A scoping question occurred to me during the call on Wednesday.  
> > > >
> > > > There seem to be two approaches that we should explicitly  
> choose
> > > between.
> > > >
> > > > APPROACH 1: Work with what’s there
> > > >
> > > > We are trying to create a description / metadata format that  
> > > would enable us to layer processing semantics over the top  
> of
> > > all the various forms of tabular data that people publish so  
> that
> > > it can be interpreted in a standard way.
> > > >
> > > > We need to do a survey of what tabular data exists in its various  
> > > formats so that we know what the description / metadata format  
> > > needs to describe. When we find data that uses different separators,  
> > > pads out the actual data using empty rows and columns, incorporates  
> > > two or more tables inside a single CSV file, or uses Excel spreadsheets  
> > > or DSPL packages or SDF packages or NetCDF or the various other  
> > > formats that people have invented, we need to keep note of these  
> > > so that whatever solution and processors we create will work  
> > > with these files.
> > > >
> > > > APPROACH 2: Invent something new
> > > >
> > > > We are trying to create a new format that would enable publishers  
> > > to publish tabular data in a more regular way while preserving  
> > > the same meaning, to make it easier for consumers of that data.  
> > > >
> > > > We need to do a survey of what tabular data exists so that we  
> can
> > > see what publishers are trying to say with their data, but the  
> > > format that they are currently publishing that data in is irrelevant  
> > > because we are going to invent a new format. When we find data  
> that
> > > includes metadata about tables and cells, or groups or has  
> cross
> > > references between tables, or has columns whose values are  
> of
> > > different types, we need to keep note of these so that we ensure  
> > > the format we create can capture that meaning.
> > > >
> > > > We also need to understand existing data so that we have a good  
> > > backwards compatibility story: it would be useful if the format  
> > > we invent can be used with existing tools, and if existing data  
> > > didn’t have to be changed very much to put it into the new format.  
> > > But there will certainly be files that do have to be changed,  
> and
> > > sometimes substantially.
> > > >
> > > >
> > > > My focus is definitely on the second approach as I think taking  
> > > the first approach is an endless and impossible task. But some  
> > > recent mails and discussion has made me think that some people  
> > > are taking the first approach. Any thoughts?
> > > >
> > > > Cheers,
> > > >
> > > > Jeni
> > > > --
> > > > Jeni Tennison
> > > > http://www.jenitennison.com/
> > > >
> > >
> > >
> > >
> > >
> >
> > --
> > Jeni Tennison
> > http://www.jenitennison.com/
> >
> >
>  

--  
Jeni Tennison
http://www.jenitennison.com/

Received on Sunday, 23 February 2014 19:47:07 UTC