Re: Scoping: "Tabular Data" from Dan Brickley on 2014-02-24 (public-csv-wg@w3.org from February 2014)

From: Dan Brickley <danbri@google.com>
Date: Sun, 23 Feb 2014 16:06:55 -0800
To: Jeni Tennison <jeni@theodi.org>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAK-qy=7dLnOxefa75-XYKreMACkj5ww3aPHLR4K_JmOQ0_fj-g@mail.gmail.com>
On 23 February 2014 15:19, Jeni Tennison <jeni@theodi.org> wrote:
> Hi,
>
> Another scoping question, brought up from Tim Finin’s example from:
>
>   https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_and_facts_extracted_from_text
>
> 1> :e4 type         PER
> 2> :e4 mention      "Bart"  D00124 283-286
> 3> :e4 mention      "JoJo"  D00124 145-149 0.9
> 4> :e4 per:siblings :e7     D00124 283-286 173-179 274-281
> 5> :e4 per:age      "10"    D00124 180-181 173-179 182-191 0.9
> 6> :e4 per:parent   :e9     D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210
>    ^   ^            ^       ^      ^       ^       ^       ^      ^       ^       ^
>    1   2            3       4      5       6       7       8      9       10      11
>
> (I’ve added numbers for the implied columns.)
>
> To me, this looks like a text-based format in which each line has a defined format, but where there isn’t the commonality between values in a single column that I would normally expect in what I would consider a tabular format.
>
> So for example, column 6 contains a certainty value on line 3 and an offset range in lines 4-6, while column 8 contains a certainty value on line 5 and a document ID on line 6.
>
> If the data looked like (comma separators added for clarity):
>
>   :e4, type,         PER,    ,
>   :e4, mention,      ”Bart”, D00124 283-286,
>   :e4, mention,      ”JoJo”, D00124 145-149,                                               0.9
>   :e4, per:siblings, :e7,    D00124 283-286 173-179 274-281,
>   :e4, per:age,      "10"    D00124 180-181 173-179 182-191,                               0.9
>   :e4, per:parent,   :e9     D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210,
>   ^    ^             ^       ^                                                             ^
>   1    2             3       4                                                             5
>
> then I would consider it tabular data and could add headers:
>
>   1: subject
>   2: predicate
>   3: object
>   4: location
>   5: certainty
>
> Can/should we define tabular data as data where all values in a given column have a common meaning?

In this last form, you might argue that when relationship typing is
pushed down into cell values, i.e. potentially a different predicate
in each row, then that column does not really have a "common meaning".
Or you might say the column does have a broader fixed meaning: it
carries information about how values from other columns relate to each
other.

For the sake of thought experiment I find it useful to come back to
pixel-style representation. Consider a 640x480 grid in which red-ness,
green-ness and blue-ness values are packed into each cell. Perhaps
with a sub-notation using ':', on a 0-1 scale for now:

So,

0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0
1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0 ... might give us a
fragment of such a grid, with neon, black, white etc cells.

Q: Do these columns have regular meaning?
A: Yes; they stand for a column of pixels in a bitmap
A: No; each row-column combination stands for a distinct entity (pixel value)

Q: Is it useful to use W3C CSVW's work to describe this?
A: Sure. It can help us get the syntax details right (whitespace,
quotes, newlines) between tools; and it can provide arbitrary per-file
metadata. For example the metadata might tell us that the grid of
colours comes from dan's security camera photo at such-and-so a date.

Q: Isn't this iffy, since there are much better binary representations
for such data? (e.g. digital image formats)
A: Yes, but that can be true for more obviously factual data too.

Maybe what I'm getting at here is that I'm not sure what "a common
meaning" for columns might mean. On the last call I tried to talk
about columns being "homogenous" but that was more in terms of low
level data-typing. For example, a column might always contain
ISO-8601-style dates, i.e. YYYY-MM-DD. But what they *mean*
(birthdate, deathdate, date hired, favourite date, ...) could be fixed
by the meaning of a different column. So the column could be
datatype-homogenous but the nature of it's per-cell meaning could vary
per cell.

Dan
Received on Monday, 24 February 2014 00:07:25 UTC