RE: CSV+ file lines with differing number of columns

Hi Tim - 

Two thoughts: 

#1: this is a good example to include as a use case. I think there's enough text here already ... it would be great if you could move this across to the wiki <https://www.w3.org/2013/csvw/wiki/Use_Cases>.

#2: your example ...

<snip>
   :e4 type         PER
   :e4 mention      "Bart" D00124 283-286
   :e4 per:siblings :e7    D00124 283-286 173-179 274-281
   :e4 per:age      "10"   D00124 180-181 173-179 182-191 0.9
</snip> 

... seems to be quite regular; from your description, the column headings might be:

subject,predicate,object,document-id,string-offset-1,string-offset-2,string-offset-3,confidence

thus, in comma delimited form, your variable-length rows become:

:e4,type,PER,,,,,
:e4,mention,"Bart",D00124,283-286,,,
:e4,per:siblings,:e7,D00124,283-286,173-179,274-281,
:e4,per:age,"10",D00124,180-181,173-179,182-191,0.9

... or am I missing something?

(I note that the repetition of the string-offset column feels quite clumsy in my modified example :-) )

Jeremy

-----Original Message-----
From: Tim Finin [mailto:finin@cs.umbc.edu] 
Sent: 19 February 2014 04:58
To: public-csv-wg@w3.org
Subject: CSV+ file lines with differing number of columns

The current draft of Syntax for Tabular Data on the Web stipulates (sec 3.3) that "Each line of a CSV+ file must contain the same number of comma-separated values."  While this seems reasonable, some existing use cases I'm familiar with allow for CSV files with several types of lines that differ in their number of columns.  Processing the CSV file requires detecting the line type and also the presence of an optional terminal column.

Might we explore relaxing the constraint that the CSV file have the same number of columns for each line?

In the 2013 NIST Cold Start Knowledge Base Population Task [1], researchers submit output from their text information extraction systems to NIST for evaluation as tab separated files.  A line consists of a triple (subj pred obj) and, for some predicates, provenance information. Provenance includes a document ID and, depending on the predicate, one or three pairs of string offsets within the document.  Each line can also have a optional float as a final column to represent a certainty measure.

The submission format does not require adding extra separators to make all of the lines uniform or explicitly adding a default value for the optional last column.

The following four lines show examples of a triple without any annotations, an entity mention with provenance, an entity relation with provenance, and a relation with both provenance and confidence annotations.

   :e4 type         PER
   :e4 mention      "Bart" D00124 283-286
   :e4 per:siblings :e7    D00124 283-286 173-179 274-281
   :e4 per:age      "10"   D00124 180-181 173-179 182-191 0.9

[1]
http://www.nist.gov/tac/2013/KBP/ColdStart/guidelines/KBP2013_ColdStartTaskDescription_1.1.pdf

Received on Wednesday, 19 February 2014 17:21:58 UTC