Re: CSV test cases

One problem with sampling existing CSV files is that they’re quite
likely to already be well structured, and limited in what they do by
the existing constraints of the CSV format.

What’s arguably more useful is to sample the range of Excel files that
have been published, to see if there's more that needs to be
supported. To start with, I’ve produced a list of URLs of Excel files
that have been published as supporting information for articles on
nature.com: https://gist.github.com/hubgit/8954821/

(Run `wget -i https://gist.github.com/hubgit/8954821/raw/nature-xls.txt`
to fetch them all).

These files show a wide range of structure that authors actually add
to tabular data, many of which are possible in HTML tables but not in
CSV files. Perhaps a JSON file accompanying a CSV file may be able to
cover some of these features?

Examples of features found in Excel spreadsheets published as
supporting data for journal articles:

* Table description and comment rows (sometimes starting with #) at
the start of the sheet
* Multiple tables in the same sheet, with a title row for each table
* Merged cells, spanning multiple rows or columns
* Text formatting (bold, italic), e.g. species names, or to show significance
* Cell formatting (background colours), to highlight grouping or patterns
* Caption (description), footer, footnotes
* Subheadings/subsections within a single table, often with indented headings

Alf

On 11 February 2014 16:21, Dan Brickley <danbri@google.com> wrote:
> On 11 February 2014 16:03, Jeni Tennison <jeni@jenitennison.com> wrote:
>> Of interest to this group, this work from Max Ogden on putting together a set of test cases for CSV parsers:
>>
>>   https://github.com/maxogden/csv-spectrum
>
> Oh, that's great. I went through the Open Office source tree last week
> looking for similar, but didn't find anything suitable.
>
> Dan
>
>> Jeni
>> --
>> Jeni Tennison
>> http://www.jenitennison.com/
>>
>

Received on Wednesday, 12 February 2014 12:54:55 UTC