Re: Consuming Documented DataSets Use Case [Was: DataStore, Layers and legacy files] from David Booth on 2015-03-05 (public-csv-wg@w3.org from March 2015)

From: David Booth <david@dbooth.org>
Date: Thu, 05 Mar 2015 18:41:32 -0500
To: Paul Klink <paul@klink.id.au>, public-csv-wg@w3.org
Message-ID: <54F8E9AC.1060501@dbooth.org>
Hi Paul,

I'm not in the "CSV on the Web" working group -- I'm just an interested 
bystander :) -- so I think the working group would have to comment on 
your use case.  Here is the use case document that the group has 
produced already:
http://www.w3.org/TR/csvw-ucr/

David Booth

On 03/05/2015 05:57 PM, Paul Klink wrote:
> Hi David,
>
> Let me take one step back and describe the Use Case I am trying to address.
>
> As a developer, I have often written software to import CSV files
> supplied by other organisations.  Typically these CSV files contain
> extracts from the other organisation's internal databases.  At our end
> we would read the data from these files, process it and then normally
> update our own databases with it.
>
> In my experience, most of these CSV files contain a single table.
> However sometimes the other organisation wants to provide us with
> multiple tables of data.  In this case the multiple tables are placed in
> one file with one column being a key to identifying which table each
> line belongs to.  By having all the tables in one file, they can safely
> reference each other.
>
> The other organisation will provide a document which describes the
> content of the data and how it is formatted in the file.  The document
> may be sent to us specifically or published via the web. The CSV file
> itself is made available via the web.
>
> The data in these CSV files is normally very application specific. I
> don't thing there would be much value in having a schema to cover the
> content.  Any patterns found between these files would be very generic.
> The work involved in providing support for any such schema is unlikely
> be worth the effort for both the producer of the data or the consumer.
>
> However, obviously, there is a pattern in the formatting of these
> files.  By having a schema which identifies the formatting, it makes it
> far easier to produce and consume the data (at a reading and writing
> level).  Ideally the schema allows the data to be accessed in the same
> way we access a database.  The following pseudo code for reading data is
> shown below:
>
> Open file (and associated meta)
> while <not at end of file>
> begin
>    Read record
>    Read values from fields
>    Do stuff with values
>    Go to next record
> end
> Close File
>
> or, in the case of a text file containing multiple tables
>
> Open file (and associated meta)
> while <not at end of file>
> begin
>    Read record
>    if <start of new table>
>    begin
>      Initialise processing for table
>    end;
>    Read values from fields
>    Do stuff with values
>    Go to next record
> end
> Close File
>
> With the above code, no knowledge is required of how the text file
> containing the data (CSV or other text variant) is formatted.  This will
> make it far easier for programmers to import files.  It also provides
> more built-in checking to confirm that the data is being correctly
> interpreted.  For example, columns are correctly chosen, data types are
> correctly interpreted.  Another benefit is that producer of the data no
> longer has to document the format.  The net effect of this is for
> significant productivity improvements in working with these files.
>
> To make this work, the organisation producing the files would need to
> generate the Meta.  They will probably only do this if:
> 1) The schema is specified by a well known standard and widely adopted
> 2) Is very easy to implement (say less than 30 minutes for a simple file)
>
> I would hazard a guess, that this Use Case is the most common use of CSV
> files on the Web.  It probably will remain so (at least in terms of
> number of organisations) for the foreseeable future.
>
> It would be great if “CSV on the Web” could cover this Use Case.  It
> seems to me that it is almost there.  It would only need to be slightly
> extended to cover a larger variety of formatting of text files.  While
> “CSV on the Web”'s charter talks a lot about meta data describing the
> content of CSV files, it states that the primary focus is to associate
> Meta data with CSV files.  I would like to think that providing
> sufficient Meta data so that existing text files can be read (and
> written) in a format independent way, would provide the foundation of a
> schema and fall within the scope of the charter.
>
> Another long post from me.  Hopefully you find it constructive.
>
> Regards
> Paul
Received on Thursday, 5 March 2015 23:42:01 UTC