Consuming Documented DataSets Use Case [Was: DataStore, Layers and legacy files]

Hi David,

Let me take one step back and describe the Use Case I am trying to address.

As a developer, I have often written software to import CSV files 
supplied by other organisations.  Typically these CSV files contain 
extracts from the other organisation's internal databases.  At our end 
we would read the data from these files, process it and then normally 
update our own databases with it.

In my experience, most of these CSV files contain a single table. 
However sometimes the other organisation wants to provide us with 
multiple tables of data.  In this case the multiple tables are placed in 
one file with one column being a key to identifying which table each 
line belongs to.  By having all the tables in one file, they can safely 
reference each other.

The other organisation will provide a document which describes the 
content of the data and how it is formatted in the file.  The document 
may be sent to us specifically or published via the web. The CSV file 
itself is made available via the web.

The data in these CSV files is normally very application specific. I 
don't thing there would be much value in having a schema to cover the 
content.  Any patterns found between these files would be very generic.  
The work involved in providing support for any such schema is unlikely 
be worth the effort for both the producer of the data or the consumer.

However, obviously, there is a pattern in the formatting of these 
files.  By having a schema which identifies the formatting, it makes it 
far easier to produce and consume the data (at a reading and writing 
level).  Ideally the schema allows the data to be accessed in the same 
way we access a database.  The following pseudo code for reading data is 
shown below:

Open file (and associated meta)
while <not at end of file>
begin
   Read record
   Read values from fields
   Do stuff with values
   Go to next record
end
Close File

or, in the case of a text file containing multiple tables

Open file (and associated meta)
while <not at end of file>
begin
   Read record
   if <start of new table>
   begin
     Initialise processing for table
   end;
   Read values from fields
   Do stuff with values
   Go to next record
end
Close File

With the above code, no knowledge is required of how the text file 
containing the data (CSV or other text variant) is formatted.  This will 
make it far easier for programmers to import files.  It also provides 
more built-in checking to confirm that the data is being correctly 
interpreted.  For example, columns are correctly chosen, data types are 
correctly interpreted.  Another benefit is that producer of the data no 
longer has to document the format.  The net effect of this is for 
significant productivity improvements in working with these files.

To make this work, the organisation producing the files would need to 
generate the Meta.  They will probably only do this if:
1) The schema is specified by a well known standard and widely adopted
2) Is very easy to implement (say less than 30 minutes for a simple file)

I would hazard a guess, that this Use Case is the most common use of CSV 
files on the Web.  It probably will remain so (at least in terms of 
number of organisations) for the foreseeable future.

It would be great if “CSV on the Web” could cover this Use Case.  It 
seems to me that it is almost there.  It would only need to be slightly 
extended to cover a larger variety of formatting of text files.  While 
“CSV on the Web”'s charter talks a lot about meta data describing the 
content of CSV files, it states that the primary focus is to associate 
Meta data with CSV files.  I would like to think that providing 
sufficient Meta data so that existing text files can be read (and 
written) in a format independent way, would provide the foundation of a 
schema and fall within the scope of the charter.

Another long post from me.  Hopefully you find it constructive.

Regards
Paul

Received on Thursday, 5 March 2015 22:58:11 UTC