Re: Model / Syntax Updates

Hi Yakov,

I take your point about the header and the names: this is a design decision that we have to make as a group. The main reasons I included it were:

  (a) that as designed the tabular data model (based on SQL) requires column names, and the header is the only place to get those

  (b) because existing good practices around CSV publication (ie from the Simple Data Format) require headers in the CSV file

  (c) because it’s a lot easier to write tools that can make the assumption that there are always headers than it is to write tools for an optional header line; it’s usually impossible to automatically detect whether a CSV file has a header line or not (people don’t publish CSV with the correct "Content-Type: text/csv;header=yes” header).

An alternative design would be to have the data model say nothing about column names, and to treat column names as annotations (like types).

Anyone else have any views on this?

Jeni

------------------------------------------------------
From: Yakov Shafranovich yakov-ietf@shaftek.org
Reply: Yakov Shafranovich yakov-ietf@shaftek.org
Date: 24 February 2014 at 11:06:26
To: Jeni Tennison jeni@jenitennison.com
Subject:  Re: Model / Syntax Updates

>  
> Is there a particular reason why a header is always required,  
> column
> names must be unique, and case sensitive? The draft says it is  
> because
> of SQL compatibility, but it may be important to elaborate as  
> to why.
>  
> Specifically:
> - regarding case sensitivity, I am not sure if all SQL implementations  
> are in fact case sensitive
> - regarding column name uniqueness - it sounds like we are assuming  
> that the column name is the unique index to the data. However,  
> I have
> seen often that the assumption in CSV files maybe that the column  
> *number*, not the *name* serves as the index. This may also explain  
> cases where the header is missing but the two systems communicating  
> via CSV know the order of columns in the file and their significance  
>  
> Also, regarding the Unicode and end of line issues with RFC 4180  
> -
> those can be fixed via an updated RFC.
>  
> Thanks,
> Yakov
>  
>  
>  
> On Sun, Feb 23, 2014 at 1:23 PM, Jeni Tennison  
> wrote:
> > Hi,
> >
> > Following the call last week, I have made some updates to the  
> "Syntax for Tabular Data on the Web" document at
> >
> > http://w3c.github.io/csvw/syntax/
> >
> > Namely:
> >
> > * I have separated out three levels of data model:
> > * a core data model which is just tables/columns/rows/fields  
> > * an annotated data model in which each of these can be annotated  
> > * a grouped data model in which there are multiple tables in a  
> group
> >
> > * I have stated that the ordering of columns is significant in  
> the core data model
> >
> > I have defined the annotated data model extremely loosely:  
> it just says that tables, columns, rows, fields and regions can  
> be annotated, but it doesn't say anything about what those annotations  
> might look like (eg that one of the annotations might be the *type*  
> of a value). I think the direction I'd like to take that is to retain  
> this very loose definition and then state that there are certain  
> annotations (eg 'type', 'unique') that are understood by particular  
> types of applications (eg validators, converters) in particular  
> ways. Does that seem like a reasonable approach?
> >
> > I haven't made any attempt to tackle the syntax for annotated  
> or grouped tables as yet.
> >
> > Jeni
> > --
> > Jeni Tennison
> > http://www.jenitennison.com/
> >
>  
>  
>  
>  

--  
Jeni Tennison
http://www.jenitennison.com/

Received on Monday, 24 February 2014 17:15:52 UTC