Re: CSV Use cases from Tim Robertson [GBIF] on 2014-05-07 (public-csv-wg@w3.org from May 2014)

From: Tim Robertson [GBIF] <trobertson@gbif.org>
Date: Wed, 7 May 2014 13:41:57 +0200
To: Eric Stephan <ericphb@gmail.com>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>, Jeremy Tandy <jeremy.tandy@metoffice.gov.uk>, "Ceolin, D." <d.ceolin@vu.nl>, Ivan Herman <ivan@w3.org>
Message-Id: <B697CC36-69A6-4BB5-B9BF-61621EC1676A@gbif.org>

Thanks Eric
Please can you consider a use case as described below?
Attached is an example and a separate meta.xml which I propose is used in the use case to illustrate the requirements - see below.

I can prepare this as a github pull request as HTML if you prefer or adjust any of this based on your feedback.

Many thanks,
Tim


Use Case #21 - The Darwin Core Archive standard (GBIF)
(Contributed by Tim Robertson, trobertson@gbif.org GBIF)

The Darwin Core Archive (DwC-A) standard (http://rs.tdwg.org/dwc/terms/guides/text/index.htm) is the primary format in use for exchange of evidence based biodiversity data on the Global Biodiversity Information Facility (GBIF http://www.gbif.org) network.  The GBIF network spans over 600+ institutions, and has mobilised more than 435 million records (http://www.gbif.org/occurrence).  The DwC-A format is embedded in many software platforms, including web based tools that allow mapping of arbitrary database schemas.  An online validator exists to verify the format (http://tools.gbif.org/dwca-validator/).  
The DwC-A format is effectively a collection of related CSV files accompanied by a metafile (meta.xml) that describes the structure and content of the CSVs along with their relationships.  Together these files are zipped to allow transfer in a single HTTP transaction.  

The key characteristics of the DwC-A format are:
- The ability to define the class of content contained within a single row 
- The ability to declare a relationship between files (only many-to-one relationships in a star schema are currently supported)
- The ability to describe remote CSV files through a meta file, without modifying the source files

The next evolution of the DwC-A needs to consider the following key uses:
- More complex arrangements of data relationships (e.g. arbitrary relational models)
- Stronger typing of data formats (only date formats are currently declared)
[It is the hope of the DwC-A standard authors that the results of the CSV WG will mean the DwC-A can be deprecated, and efforts can be spent on developing tooling that supports the W3C CSV standard/recommendations]

Example: Suggest using the attached meta.xml to indicate the relationships 

Requires: HeadingColumns, CellValueMicroSyntax, NonStandardFieldDelimiter, ExternalDataDefinitionResource, AnnotationAndSupplementaryInfo, AssociationOfCodeValuesWithExternalDefinitions, SyntacticTypeDefinition , PrimaryKey, ForeignKeyReferences, MissingValueDefinition, MultipleHeadingRows




On 06 May 2014, at 19:41, Eric Stephan <ericphb@gmail.com> wrote:

> Tim,
> 
> I agree, I do think it makes sense to include this in the use case
> document.  Thank you for sharing, and yes could you please provide
> example(s) to illustrate the use case?    Either text or images
> showing snapshots of the examples would be great.  I am copying the
> csv working group distribution list as well.
> 
> 
> Thank you,
> 
> Eric
> 
> On Tue, May 6, 2014 at 2:47 AM, Tim Robertson [GBIF]
> <trobertson@gbif.org> wrote:
>> Hi Jeremy, Davide, Eric,
>> 
>> Are you still accepting use cases for the CSV WG [1] document you are
>> compiling?
>> 
>> If so, I am keen to submit one for the GBIF network [2] and would start
>> documenting one along the lines of the existing 20 cases.  It is unlikely to
>> bring significant new requirements, but would encapsulate pretty much all of
>> the existing ones, and the devil is always in the detail with this kind of
>> thing (e.g. null handling, micro syntax, default value policies etc) - our
>> use case may well bring in some sub requirements / ideas.  Our case is more
>> closely aligned with Google DSPL [3] than the others however (e.g. an XML
>> document that serves to define the content found in CSVs and their
>> relationships - I assume these to be considered "CSV annotations”).   I am a
>> little surprised not to see a G-DSPL on the list of use cases - should it be
>> one?  I would be happy to produce an example for that as well if considered
>> useful.  My slight worry is that unless cases such as ours and G-DSPL are
>> considered, the foreign key / primary key requirements *may* not be
>> adequately addressed consistently (e.g. referential integrity with respect
>> to well-formedness, expected behaviour on NULLs etc).
>> 
>> Thanks for the consideration - please do help advise me if my ideas /
>> proposals are off topic.
>> I should mention that maintaining a standard for CSV handling is part of my
>> core job, and fundamental to our infrastructure - this is a group of real
>> importance to our work.  I’d be happy to help in any way I can.
>> 
>> Best wishes,
>> Tim
>> 
>> [1] http://w3c.github.io/csvw/use-cases-and-requirements/index.html
>> [2] http://www.gbif.org/
>> [3] https://developers.google.com/public-data/

Attachments

text/html attachment: stored
application/zip attachment: dwca.zip
text/html attachment: stored
application/xml attachment: meta.xml
text/html attachment: stored

Received on Wednesday, 7 May 2014 11:42:26 UTC