- From: Dan Brickley <danbri@google.com>
- Date: Sun, 23 Feb 2014 16:06:55 -0800
- To: Jeni Tennison <jeni@theodi.org>
- Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
On 23 February 2014 15:19, Jeni Tennison <jeni@theodi.org> wrote: > Hi, > > Another scoping question, brought up from Tim Finin’s example from: > > https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_and_facts_extracted_from_text > > 1> :e4 type PER > 2> :e4 mention "Bart" D00124 283-286 > 3> :e4 mention "JoJo" D00124 145-149 0.9 > 4> :e4 per:siblings :e7 D00124 283-286 173-179 274-281 > 5> :e4 per:age "10" D00124 180-181 173-179 182-191 0.9 > 6> :e4 per:parent :e9 D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210 > ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ > 1 2 3 4 5 6 7 8 9 10 11 > > (I’ve added numbers for the implied columns.) > > To me, this looks like a text-based format in which each line has a defined format, but where there isn’t the commonality between values in a single column that I would normally expect in what I would consider a tabular format. > > So for example, column 6 contains a certainty value on line 3 and an offset range in lines 4-6, while column 8 contains a certainty value on line 5 and a document ID on line 6. > > If the data looked like (comma separators added for clarity): > > :e4, type, PER, , > :e4, mention, ”Bart”, D00124 283-286, > :e4, mention, ”JoJo”, D00124 145-149, 0.9 > :e4, per:siblings, :e7, D00124 283-286 173-179 274-281, > :e4, per:age, "10" D00124 180-181 173-179 182-191, 0.9 > :e4, per:parent, :e9 D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210, > ^ ^ ^ ^ ^ > 1 2 3 4 5 > > then I would consider it tabular data and could add headers: > > 1: subject > 2: predicate > 3: object > 4: location > 5: certainty > > Can/should we define tabular data as data where all values in a given column have a common meaning? In this last form, you might argue that when relationship typing is pushed down into cell values, i.e. potentially a different predicate in each row, then that column does not really have a "common meaning". Or you might say the column does have a broader fixed meaning: it carries information about how values from other columns relate to each other. For the sake of thought experiment I find it useful to come back to pixel-style representation. Consider a 640x480 grid in which red-ness, green-ness and blue-ness values are packed into each cell. Perhaps with a sub-notation using ':', on a 0-1 scale for now: So, 0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0 ... might give us a fragment of such a grid, with neon, black, white etc cells. Q: Do these columns have regular meaning? A: Yes; they stand for a column of pixels in a bitmap A: No; each row-column combination stands for a distinct entity (pixel value) Q: Is it useful to use W3C CSVW's work to describe this? A: Sure. It can help us get the syntax details right (whitespace, quotes, newlines) between tools; and it can provide arbitrary per-file metadata. For example the metadata might tell us that the grid of colours comes from dan's security camera photo at such-and-so a date. Q: Isn't this iffy, since there are much better binary representations for such data? (e.g. digital image formats) A: Yes, but that can be true for more obviously factual data too. Maybe what I'm getting at here is that I'm not sure what "a common meaning" for columns might mean. On the last call I tried to talk about columns being "homogenous" but that was more in terms of low level data-typing. For example, a column might always contain ISO-8601-style dates, i.e. YYYY-MM-DD. But what they *mean* (birthdate, deathdate, date hired, favourite date, ...) could be fixed by the meaning of a different column. So the column could be datatype-homogenous but the nature of it's per-cell meaning could vary per cell. Dan
Received on Monday, 24 February 2014 00:07:25 UTC