- From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
- Date: Mon, 24 Feb 2014 11:50:21 +0000
- To: Jeni Tennison <jeni@jenitennison.com>, Dan Brickley <danbri@google.com>
- CC: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
JeniT said: """ Each row is(*) a X and the columns are the A,B,C… of the X """ We've made the assumption that each row within a table describes _something_ ... is this still a safe assumption? I think/hope so. The birthdate / deathdate is, I think, an example of (less than optimal) practice for CSV. person1,birthdate,1912-04-23 person1,deathdate,1993-03-30 I think that it is hard(er) for parsers to interpret a given row based on the value of a given field. I would we prefer people to publish like so: Name,Birth date,Death date Freddie Frederickson,1912-04-23,1993-03-30 Then we can map the row to a specific entity (ex:person1) and see that each column has clearly defined & homogenous meaning; foaf:name, ex:birthdate, ex:deathdate So our best practice would encourage homogenous meaning in a given column. Finally, Tim's example suggests the use of sub-structure within a given cell: "Location" D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210 Personally, the idea of "having to parse twice" is a bit problematic ... why should consumers have to maintain two tool sets ... and how do they know what structure is used? OK; contrary examples of embedded structure are xsd:dateTime and geosparql:wktLiteral - but these follow a well defined structure AND the semantics are given by the attribute (predicate) that refers to that type. But in each of these cases, there's still only _one_ thing being described, where as Tim's "Location" seems to be providing multiple assertions. Perhaps we could unpack the assertions as follows: "Location" D00124 180-181 D00124 381-380 D00124 399-406 D00101 220-225 D00101 230-233 D00101 201-210 Of course, this means we now have multiple "cells" providing the Location assertion for the same entity. So we can either have (up to) 6 Location columns (yuk) or implement some mechanism for repetition. In Jeni's Linked CSV proposal <http://jenit.github.io/linked-csv/> (section 2.1 Identifiers) she says: "a single entity may be described by multiple records within the linked CSV file" In which case we need a mechanism to relate multiple rows within a CSV file to a given entity (see requirement R-PrimaryKey <http://w3c.github.io/csvw/use-cases-and-requirements/#R-PrimaryKey> Again - this would form part of our best practices ... My tuppence (or 2¢ as said across the pond). Jeremy -----Original Message----- From: Jeni Tennison [mailto:jeni@jenitennison.com] Sent: 24 February 2014 01:03 To: Dan Brickley Cc: public-csv-wg@w3.org Subject: Re: Scoping: "Tabular Data" Dan, But in the case where you have: person1,birthdate,1912-04-23 person1,deathdate,1993-03-30 ... you can still label the columns in a regular way (entity, property, value). You can fill in a statement that says: "Each row is(*) a X and the columns are the A,B,C… of the X” ie “Each row is a *statement* and the columns are the *entity*, *property* and *value* of the statement.” In your bitmap case, you can say: “Each row is a *row of a bitmap* and the columns are the *first pixel*, *second pixel*, *third pixel*... of the row.” Conversely, in Tim’s case, you can say “Each row is a statement”, but you can’t name the columns in a regular way in terms of being a property of each statement. Cheers, Jeni (*) or “represents” or “contains information about” or whatever you want to say to be more semantically accurate ------------------------------------------------------ From: Dan Brickley danbri@google.com Reply: Dan Brickley danbri@google.com Date: 23 February 2014 at 16:09:18 To: Jeni Tennison jeni@theodi.org Subject: Re: Scoping: "Tabular Data" > > On 23 February 2014 15:19, Jeni Tennison > wrote: > > Hi, > > > > Another scoping question, brought up from Tim Finin’s example > from: > > > > https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_an > > d_facts_extracted_from_text > > > > 1> :e4 type PER > > 2> :e4 mention "Bart" D00124 283-286 > > 3> :e4 mention "JoJo" D00124 145-149 0.9 > > 4> :e4 per:siblings :e7 D00124 283-286 173-179 274-281 > > 5> :e4 per:age "10" D00124 180-181 173-179 182-191 0.9 > > 6> :e4 per:parent :e9 D00124 180-181 381-380 399-406 D00101 > 220-225 230-233 201-210 > > ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ > > 1 2 3 4 5 6 7 8 9 10 11 > > > > (I’ve added numbers for the implied columns.) > > > > To me, this looks like a text-based format in which each line > has a defined format, but where there isn’t the commonality between > values in a single column that I would normally expect in what I would > consider a tabular format. > > > > So for example, column 6 contains a certainty value on line 3 > and an offset range in lines 4-6, while column 8 contains a certainty > value on line 5 and a document ID on line 6. > > > > If the data looked like (comma separators added for clarity): > > > > :e4, type, PER, , > > :e4, mention, ”Bart”, D00124 283-286, :e4, mention, ”JoJo”, D00124 > > 145-149, 0.9 :e4, per:siblings, :e7, D00124 283-286 173-179 274-281, > > :e4, per:age, "10" D00124 180-181 173-179 182-191, 0.9 :e4, > > per:parent, :e9 D00124 180-181 381-380 399-406 D00101 > 220-225 230-233 201-210, > > ^ ^ ^ ^ ^ > > 1 2 3 4 5 > > > > then I would consider it tabular data and could add headers: > > > > 1: subject > > 2: predicate > > 3: object > > 4: location > > 5: certainty > > > > Can/should we define tabular data as data where all values in > a given column have a common meaning? > > In this last form, you might argue that when relationship typing is > pushed down into cell values, i.e. potentially a different predicate > in each row, then that column does not really have a "common meaning". > Or you might say the column does have a broader fixed meaning: > it > carries information about how values from other columns relate to each > other. > > For the sake of thought experiment I find it useful to come back to > pixel-style representation. Consider a 640x480 grid in which red-ness, > green-ness and blue-ness values are packed into each cell. Perhaps > with a sub-notation using ':', on a 0-1 scale for now: > > So, > > 0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0 1.0:1.0:1.0, > 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0 ... might give us a fragment of > such a grid, with neon, black, white etc cells. > > Q: Do these columns have regular meaning? > A: Yes; they stand for a column of pixels in a bitmap > A: No; each row-column combination stands for a distinct entity (pixel > value) > > Q: Is it useful to use W3C CSVW's work to describe this? > A: Sure. It can help us get the syntax details right (whitespace, > quotes, newlines) between tools; and it can provide arbitrary per-file > metadata. For example the metadata might tell us that the grid of > colours comes from dan's security camera photo at such-and-so a date. > > Q: Isn't this iffy, since there are much better binary representations > for such data? (e.g. digital image formats) > A: Yes, but that can be true for more obviously factual data too. > > Maybe what I'm getting at here is that I'm not sure what "a common > meaning" for columns might mean. On the last call I tried to talk > about columns being "homogenous" but that was more in terms of low > level data-typing. For example, a column might always contain > ISO-8601-style dates, i.e. YYYY-MM-DD. But what they *mean* > (birthdate, deathdate, date hired, favourite date, ...) could be fixed > by the meaning of a different column. So the column could be > datatype-homogenous but the nature of it's per-cell meaning could vary > per cell. > > Dan > > > -- Jeni Tennison http://www.jenitennison.com/
Received on Monday, 24 February 2014 11:50:50 UTC