Re: CSVs vs Tables

Hi Jeni,

I must admit that, so far, my understanding was that the metadata we define are metadata on the tables only, and not on the CSV files. Actually, that is what the current document says (Section 2.1):

[[[
The metadata defined in this specification is used to annotate an existing annotated table as defined in [CSV-MODEL]. Annotated tables form the basis for all further processing, such as validating or displaying the table. 
]]]

The way I was reading this is that the metadata content is associated to the abstract model; it is of course possible that the metadata document itself contains some additional parsing information, along the lines of section 5 in the model document but, just as section 5 is informative, we would not define those metadata ourselves (because we do not normatively define the parser).

I just want to check whether I really miss something if I say that what we have today is option #3 below, except that the flags in section 5 are translated into additional JSON properties (although they cannot be normative, because we do not define parsing normatively:-(. I have actually no problem with that, it is just to check whether we are really talking about the same thing...

In this respect, I am not sure about choice #2. Indeed:

- either we do not define anything about parsing, which is #1
- or we do translate what is in the model into metadata, in which case synchronization of section 5 in the model with the metadata seems like the proper thing to do, which is choice #3...

However, I do not really understand your examples below, hence there may be some misunderstanding. I am not sure what you mean by 

notes: [{
      type: row,
      number: 3,
      ...
    }, {
      type: cell,
      row: 4,
      column: 5,
      ...
    }]

or

  notes: {
      #row=3: { … }
      #cell=4,5: { … }
    }

which parsing options are you referring to? The only thing I see is a number of parameters that are under the 

source: {
      @id: trees.csv#cell=3,4-31453,19 // a region of a CSV file
      header: true,
      encoding: ISO-8859-1,
      separator: ;,
      escape: \"
      ....
    }

object...

Ivan
 





On 02 Jun 2014, at 15:01 , Jeni Tennison <jeni@jenitennison.com> wrote:

> Hi,
> 
> I discussed the scope of the metadata format with Rufus last week and it raised the question about what assumptions we can make about the “CSV” documents that the metadata refers to. The three options are something like:
> 
> 1. We assume a 1:1 relationship between tables and CSV files (and between rows/columns/cells in the table and rows/columns/cells in the CSV file), and that the CSV files will be CSV+ (comma separated, UTF-8, no padding etc). This enables us to have a pretty simple metadata format that just includes something like:
> 
>   {
>     @id: trees.csv,
>     header: true,
>     notes: {
>       #row=3: { … }
>       #cell=4,5: { … }
>     }
>   }
> 
> The only information that the metadata might need to contain to inform the parsing of the CSV file is whether or not it contains a (single) header line. The advantages are that this is simple and it encourages people to publish well-formed CSV files. The fact that the rows/columns/cells are 1:1 in the table and CSV file makes referencing them easy using fragment identifiers. The disadvantage is that it excludes a whole bunch of “CSV” files that aren’t well-formed.
> 
> 2. We assume a 1:1 relationship between tables and CSV files (and between rows/columns/cells in the table and rows/columns/cells in the CSV file), but allow the “CSV” files some latitude in how they’re formed (specifically, enabling different separators & escape sequences, and different encodings, but not supporting padding). The metadata format then needs to have properties that provide some information about how the file should be parsed, so something like:
> 
>   {
>     @id: trees.csv
>     parsing: {
>       header: true,
>       encoding: ISO-8859-1,
>       separator: ;,
>       escape: \"
>     }
>     notes: {
>       #row=3: { … }
>       #cell=4,5: { … }
>     }
>   }
> 
> This is still fairly simple, but we do have to specify some parsing options. The fact that the rows/columns/cells are 1:1 in the table and CSV file makes referencing them easy using fragment identifiers. It gives people more latitude in the CSV files that they publish and enables metadata to be used to describe some (but not all) legacy tabular data.
> 
> 3. We don’t assume a 1:1 relationship between tables and CSV files, but support the full set of parsing options that are outlined in the Tabular Model document. This means the references to both the source data file and the rows/columns/cells within it have to be done slightly differently, for example:
> 
>   {
>     source: {
>       @id: trees.csv#cell=3,4-31453,19 // a region of a CSV file
>       header: true,
>       encoding: ISO-8859-1,
>       separator: ;,
>       escape: \"
>     }
>     notes: [{
>       type: row,
>       number: 3,
>       ...
>     }, {
>       type: cell,
>       row: 4,
>       column: 5,
>       ...
>     }]
>   }
> 
> This is great in terms of being general purpose, but it involves more specification work and I think will be prone to confusion about the mismatch between row/column numbers in the “CSV” file vs those in the described table.
> 
> 
> I am in favour of #1 or #2 and not doing #3. #2 seems to meet the use cases / requirements that we’ve gathered so far (excepting the deferred requirement for multiple heading rows). I wanted to double check that meets everyone’s expectations.
> 
> Jeni
> --  
> Jeni Tennison
> http://www.jenitennison.com/
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me

Received on Monday, 2 June 2014 20:11:05 UTC