CSV+ file lines with differing number of columns from Tim Finin on 2014-02-19 (public-csv-wg@w3.org from February 2014)

From: Tim Finin <finin@cs.umbc.edu>
Date: Tue, 18 Feb 2014 23:58:00 -0500
To: public-csv-wg@w3.org
Message-ID: <530439D8.7000309@cs.umbc.edu>

The current draft of Syntax for Tabular Data on the Web
stipulates (sec 3.3) that "Each line of a CSV+ file must contain
the same number of comma-separated values."  While this seems
reasonable, some existing use cases I'm familiar with allow for
CSV files with several types of lines that differ in their number
of columns.  Processing the CSV file requires detecting the line
type and also the presence of an optional terminal column.

Might we explore relaxing the constraint that the CSV file have
the same number of columns for each line?

In the 2013 NIST Cold Start Knowledge Base Population Task [1],
researchers submit output from their text information extraction
systems to NIST for evaluation as tab separated files.  A line
consists of a triple (subj pred obj) and, for some predicates,
provenance information. Provenance includes a document ID and,
depending on the predicate, one or three pairs of string offsets
within the document.  Each line can also have a optional float as
a final column to represent a certainty measure.

The submission format does not require adding extra separators to
make all of the lines uniform or explicitly adding a default
value for the optional last column.

The following four lines show examples of a triple without any
annotations, an entity mention with provenance, an entity
relation with provenance, and a relation with both provenance and
confidence annotations.

   :e4 type         PER
   :e4 mention      "Bart" D00124 283-286
   :e4 per:siblings :e7    D00124 283-286 173-179 274-281
   :e4 per:age      "10"   D00124 180-181 173-179 182-191 0.9

[1] 
http://www.nist.gov/tac/2013/KBP/ColdStart/guidelines/KBP2013_ColdStartTaskDescription_1.1.pdf

Received on Wednesday, 19 February 2014 04:58:23 UTC