RE: additional tool testing of CSV files in Syntax document

Agreed that we need a canonical reference. To verify what I was seeing in Excel, I had to view one of the files (e.g. test-utf8-bom.csv) in a competent text editor (taking account of the BOM).

I think that the only way of guaranteeing what you see is correct for each cell is to split out each cell into a separate file as you recommend. We assume UTF8 ... so let's be explicit and assert that our test results are in UTF8 and that no BOM is included in each file.

Also, would note that we probably don't need to repeat this for _every_ test file; the existing test files used in the Syntax doc _should_ all render the same (i.e. we only need one set of "expected" files for each batch of tests).

File naming recommendations. Agree with proposed ".txt" suffix. Suggest use of CSV fragment identifier syntax <http://tools.ietf.org/html/draft-hausenblas-csv-fragment-02#section-1.2> to name each file. Thus the "test[-{encoding}].csv" file would have reference files such as "test#cell={row},{col}.txt". See attached example.

Jeremy

-----Original Message-----
From: Dan Brickley [mailto:danbri@google.com] 
Sent: 19 February 2014 14:59
To: Tandy, Jeremy
Cc: public-csv-wg@w3.org
Subject: Re: additional tool testing of CSV files in Syntax document

On 19 February 2014 10:58, Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk> wrote:
> All – I’ve checked how Excel 2007 on Win 7 Enterprise renders the 
> files referenced in the Syntax document. I’ve created a wiki page 
> (linked from
> “Tools”) with the results:
> https://www.w3.org/2013/csvw/wiki/MS_Excel_compatibility_tests


This (and the tests in git,
https://github.com/w3c/csvw/tree/gh-pages/syntax  ) are great.

Slight aside but, ...

I was thinking of ways in which we might express tests against CSV files. Would this kind of structure make sense?:

For each CSVish file, e.g.
https://github.com/w3c/csvw/blob/gh-pages/syntax/test-utf8.csv


Have a corresponding tests directory, e.g. test-utf8-expected/

and then within that, one file per cell, so filenames and contents something like this:

cell_labels_2.txt: test number

cell_0_0.txt: Я могу есть стекло, оно мне не вредит.

cell_1_2.txt: 2014-02-11

cell_2_0.txt: ""never again""

we said

In other words, the bytes (assumed utf-8?) in each text file would correspond to cell values indicated in the filename. An alternative would be to have some other canonically easy to parse tabular data notation (such as how we used ntriples in the old RDF WG).

We could then use different CSV parsers and check whether the expected contents match the parsed results.

</thinking out loud>

Dan

Received on Wednesday, 19 February 2014 16:20:35 UTC