Re: Consuming Documented DataSets Use Case [Was: DataStore, Layers and legacy files] from Jeni Tennison on 2015-03-06 (public-csv-wg@w3.org from March 2015)

From: Jeni Tennison <jeni.tennison@gmail.com>
Date: Fri, 6 Mar 2015 17:09:50 +0000
To: Paul Klink <paul@klink.id.au>, public-csv-wg@w3.org
Message-ID: <etPan.54f9df5e.79e2a9e3.192@jenit.local>
Hi Paul,

We’ve tried in our documents (most pertinent editor’s draft is here [1]) to separate out the concerns of (1) parsing tabular data files into a tabular data model and (2) processing (eg display, conversion) of the tabular data model.

The tabular data model supports having multiple associated tables, which is an important part of your use case. It would be possible for someone (but not us; it’s not in scope) to define a generic format for capturing multiple tables in a single file and how that was mapped into the tabular data model that we’ve defined. Then implementations could adopt that format, and could apply the metadata which we’ve defined to those models.

So we kind of support the use case that you’re talking about, just not end to end because it’s not in scope for us to define the formats for expressing tabular data.

Does that make sense?

Cheers,

Jeni

[1] http://w3c.github.io/csvw/syntax/

-----Original Message-----
From: Paul Klink <paul@klink.id.au>
Reply: Paul Klink <paul@klink.id.au>>
Date: 5 March 2015 at 22:57:44
To: public-csv-wg@w3.org <public-csv-wg@w3.org>>
Subject:  Consuming Documented DataSets Use Case [Was: DataStore, Layers and legacy files]

> Hi David,
>  
> Let me take one step back and describe the Use Case I am trying to address.
>  
> As a developer, I have often written software to import CSV files
> supplied by other organisations. Typically these CSV files contain
> extracts from the other organisation's internal databases. At our end
> we would read the data from these files, process it and then normally
> update our own databases with it.
>  
> In my experience, most of these CSV files contain a single table.
> However sometimes the other organisation wants to provide us with
> multiple tables of data. In this case the multiple tables are placed in
> one file with one column being a key to identifying which table each
> line belongs to. By having all the tables in one file, they can safely
> reference each other.
>  
> The other organisation will provide a document which describes the
> content of the data and how it is formatted in the file. The document
> may be sent to us specifically or published via the web. The CSV file
> itself is made available via the web.
>  
> The data in these CSV files is normally very application specific. I
> don't thing there would be much value in having a schema to cover the
> content. Any patterns found between these files would be very generic.
> The work involved in providing support for any such schema is unlikely
> be worth the effort for both the producer of the data or the consumer.
>  
> However, obviously, there is a pattern in the formatting of these
> files. By having a schema which identifies the formatting, it makes it
> far easier to produce and consume the data (at a reading and writing
> level). Ideally the schema allows the data to be accessed in the same
> way we access a database. The following pseudo code for reading data is
> shown below:
>  
> Open file (and associated meta)
> while  
> begin
> Read record
> Read values from fields
> Do stuff with values
> Go to next record
> end
> Close File
>  
> or, in the case of a text file containing multiple tables
>  
> Open file (and associated meta)
> while  
> begin
> Read record
> if  
> begin
> Initialise processing for table
> end;
> Read values from fields
> Do stuff with values
> Go to next record
> end
> Close File
>  
> With the above code, no knowledge is required of how the text file
> containing the data (CSV or other text variant) is formatted. This will
> make it far easier for programmers to import files. It also provides
> more built-in checking to confirm that the data is being correctly
> interpreted. For example, columns are correctly chosen, data types are
> correctly interpreted. Another benefit is that producer of the data no
> longer has to document the format. The net effect of this is for
> significant productivity improvements in working with these files.
>  
> To make this work, the organisation producing the files would need to
> generate the Meta. They will probably only do this if:
> 1) The schema is specified by a well known standard and widely adopted
> 2) Is very easy to implement (say less than 30 minutes for a simple file)
>  
> I would hazard a guess, that this Use Case is the most common use of CSV
> files on the Web. It probably will remain so (at least in terms of
> number of organisations) for the foreseeable future.
>  
> It would be great if “CSV on the Web” could cover this Use Case. It
> seems to me that it is almost there. It would only need to be slightly
> extended to cover a larger variety of formatting of text files. While
> “CSV on the Web”'s charter talks a lot about meta data describing the
> content of CSV files, it states that the primary focus is to associate
> Meta data with CSV files. I would like to think that providing
> sufficient Meta data so that existing text files can be read (and
> written) in a format independent way, would provide the foundation of a
> schema and fall within the scope of the charter.
>  
> Another long post from me. Hopefully you find it constructive.
>  
> Regards
> Paul
>  

--  
Jeni Tennison
Sent with Airmail
Received on Friday, 6 March 2015 17:10:20 UTC