- From: Tim Finin <finin@cs.umbc.edu>
- Date: Tue, 18 Feb 2014 23:58:00 -0500
- To: public-csv-wg@w3.org
The current draft of Syntax for Tabular Data on the Web stipulates (sec 3.3) that "Each line of a CSV+ file must contain the same number of comma-separated values." While this seems reasonable, some existing use cases I'm familiar with allow for CSV files with several types of lines that differ in their number of columns. Processing the CSV file requires detecting the line type and also the presence of an optional terminal column. Might we explore relaxing the constraint that the CSV file have the same number of columns for each line? In the 2013 NIST Cold Start Knowledge Base Population Task [1], researchers submit output from their text information extraction systems to NIST for evaluation as tab separated files. A line consists of a triple (subj pred obj) and, for some predicates, provenance information. Provenance includes a document ID and, depending on the predicate, one or three pairs of string offsets within the document. Each line can also have a optional float as a final column to represent a certainty measure. The submission format does not require adding extra separators to make all of the lines uniform or explicitly adding a default value for the optional last column. The following four lines show examples of a triple without any annotations, an entity mention with provenance, an entity relation with provenance, and a relation with both provenance and confidence annotations. :e4 type PER :e4 mention "Bart" D00124 283-286 :e4 per:siblings :e7 D00124 283-286 173-179 274-281 :e4 per:age "10" D00124 180-181 173-179 182-191 0.9 [1] http://www.nist.gov/tac/2013/KBP/ColdStart/guidelines/KBP2013_ColdStartTaskDescription_1.1.pdf
Received on Wednesday, 19 February 2014 04:58:23 UTC