RE: Scoping: "Tabular Data" from Tandy, Jeremy on 2014-02-24 (public-csv-wg@w3.org from February 2014)

From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
Date: Mon, 24 Feb 2014 11:50:21 +0000
To: Jeni Tennison <jeni@jenitennison.com>, Dan Brickley <danbri@google.com>
CC: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <2624871D9A05174691BD59F8EFD68AE2B35534@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk>
JeniT said: """ Each row is(*) a X and the columns are the A,B,C… of the X """

We've made the assumption that each row within a table describes _something_ ... is this still a safe assumption? I think/hope so.

The birthdate / deathdate is, I think, an example of (less than optimal) practice for CSV. 

  person1,birthdate,1912-04-23
  person1,deathdate,1993-03-30

I think that it is hard(er) for parsers to interpret a given row based on the value of a given field. I would we prefer people to publish like so:

  Name,Birth date,Death date
  Freddie Frederickson,1912-04-23,1993-03-30

Then we can map the row to a specific entity (ex:person1) and see that each column has clearly defined & homogenous meaning; foaf:name, ex:birthdate, ex:deathdate

So our best practice would encourage homogenous meaning in a given column.

Finally, Tim's example suggests the use of sub-structure within a given cell:

  "Location"
  D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210

Personally, the idea of "having to parse twice" is a bit problematic ... why should consumers have to maintain two tool sets ... and how do they know what structure is used? OK; contrary examples of embedded structure are xsd:dateTime and geosparql:wktLiteral - but these follow a well defined structure AND the semantics are given by the attribute (predicate) that refers to that type. But in each of these cases, there's still only _one_ thing being described, where as Tim's "Location" seems to be providing multiple assertions.

Perhaps we could unpack the assertions as follows:

  "Location"
  D00124 180-181
  D00124 381-380
  D00124 399-406
  D00101 220-225
  D00101 230-233
  D00101 201-210

Of course, this means we now have multiple "cells" providing the Location assertion for the same entity. So we can either have (up to) 6 Location columns (yuk) or implement some mechanism for repetition. In Jeni's Linked CSV proposal <http://jenit.github.io/linked-csv/> (section 2.1 Identifiers) she says:

  "a single entity may be described by multiple records within the linked CSV file"

In which case we need a mechanism to relate multiple rows within a CSV file to a given entity (see requirement R-PrimaryKey <http://w3c.github.io/csvw/use-cases-and-requirements/#R-PrimaryKey>

Again - this would form part of our best practices ...

My tuppence (or 2¢ as said across the pond).

Jeremy

-----Original Message-----
From: Jeni Tennison [mailto:jeni@jenitennison.com] 
Sent: 24 February 2014 01:03
To: Dan Brickley
Cc: public-csv-wg@w3.org
Subject: Re: Scoping: "Tabular Data"

Dan,

But in the case where you have:

  person1,birthdate,1912-04-23
  person1,deathdate,1993-03-30
  ...

you can still label the columns in a regular way (entity, property, value). You can fill in a statement that says:

  "Each row is(*) a X and the columns are the A,B,C… of the X”

ie

  “Each row is a *statement* and the columns are the *entity*, *property* and *value* of the statement.”

In your bitmap case, you can say:

  “Each row is a *row of a bitmap* and the columns are the *first pixel*, *second pixel*, *third pixel*... of the row.”

Conversely, in Tim’s case, you can say “Each row is a statement”, but you can’t name the columns in a regular way in terms of being a property of each statement.

Cheers,

Jeni

(*) or “represents” or “contains information about” or whatever you want to say to be more semantically accurate

------------------------------------------------------
From: Dan Brickley danbri@google.com
Reply: Dan Brickley danbri@google.com
Date: 23 February 2014 at 16:09:18
To: Jeni Tennison jeni@theodi.org
Subject:  Re: Scoping: "Tabular Data"

>  
> On 23 February 2014 15:19, Jeni Tennison
> wrote:
> > Hi,
> >
> > Another scoping question, brought up from Tim Finin’s example
> from:
> >
> > https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_an

> > d_facts_extracted_from_text
> >
> > 1> :e4 type PER
> > 2> :e4 mention "Bart" D00124 283-286
> > 3> :e4 mention "JoJo" D00124 145-149 0.9
> > 4> :e4 per:siblings :e7 D00124 283-286 173-179 274-281
> > 5> :e4 per:age "10" D00124 180-181 173-179 182-191 0.9
> > 6> :e4 per:parent :e9 D00124 180-181 381-380 399-406 D00101
> 220-225 230-233 201-210
> > ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
> > 1 2 3 4 5 6 7 8 9 10 11
> >
> > (I’ve added numbers for the implied columns.)
> >
> > To me, this looks like a text-based format in which each line
> has a defined format, but where there isn’t the commonality between 
> values in a single column that I would normally expect in what I would 
> consider a tabular format.
> >
> > So for example, column 6 contains a certainty value on line 3
> and an offset range in lines 4-6, while column 8 contains a certainty 
> value on line 5 and a document ID on line 6.
> >
> > If the data looked like (comma separators added for clarity):  
> >
> > :e4, type, PER, ,
> > :e4, mention, ”Bart”, D00124 283-286, :e4, mention, ”JoJo”, D00124 
> > 145-149, 0.9 :e4, per:siblings, :e7, D00124 283-286 173-179 274-281, 
> > :e4, per:age, "10" D00124 180-181 173-179 182-191, 0.9 :e4, 
> > per:parent, :e9 D00124 180-181 381-380 399-406 D00101
> 220-225 230-233 201-210,
> > ^ ^ ^ ^ ^
> > 1 2 3 4 5
> >
> > then I would consider it tabular data and could add headers:  
> >
> > 1: subject
> > 2: predicate
> > 3: object
> > 4: location
> > 5: certainty
> >
> > Can/should we define tabular data as data where all values in
> a given column have a common meaning?
>  
> In this last form, you might argue that when relationship typing is 
> pushed down into cell values, i.e. potentially a different predicate 
> in each row, then that column does not really have a "common meaning".
> Or you might say the column does have a broader fixed meaning:  
> it
> carries information about how values from other columns relate to each 
> other.
>  
> For the sake of thought experiment I find it useful to come back to 
> pixel-style representation. Consider a 640x480 grid in which red-ness, 
> green-ness and blue-ness values are packed into each cell. Perhaps 
> with a sub-notation using ':', on a 0-1 scale for now:
>  
> So,
>  
> 0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0 1.0:1.0:1.0, 
> 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0 ... might give us a fragment of 
> such a grid, with neon, black, white etc cells.
>  
> Q: Do these columns have regular meaning?
> A: Yes; they stand for a column of pixels in a bitmap
> A: No; each row-column combination stands for a distinct entity (pixel 
> value)
>  
> Q: Is it useful to use W3C CSVW's work to describe this?
> A: Sure. It can help us get the syntax details right (whitespace, 
> quotes, newlines) between tools; and it can provide arbitrary per-file 
> metadata. For example the metadata might tell us that the grid of 
> colours comes from dan's security camera photo at such-and-so a date.
>  
> Q: Isn't this iffy, since there are much better binary representations 
> for such data? (e.g. digital image formats)
> A: Yes, but that can be true for more obviously factual data too.  
>  
> Maybe what I'm getting at here is that I'm not sure what "a common 
> meaning" for columns might mean. On the last call I tried to talk 
> about columns being "homogenous" but that was more in terms of low 
> level data-typing. For example, a column might always contain 
> ISO-8601-style dates, i.e. YYYY-MM-DD. But what they *mean* 
> (birthdate, deathdate, date hired, favourite date, ...) could be fixed 
> by the meaning of a different column. So the column could be 
> datatype-homogenous but the nature of it's per-cell meaning could vary 
> per cell.
>  
> Dan
>  
>  
>  

--
Jeni Tennison
http://www.jenitennison.com/
Received on Monday, 24 February 2014 11:50:50 UTC