- From: Jeni Tennison <jeni@jenitennison.com>
- Date: Sun, 2 Mar 2014 21:05:29 +0000
- To: Dan Brickley <danbri@google.com>
- Cc: public-csv-wg@w3.org, "Ceolin, D." <d.ceolin@vu.nl>
It might not be worth getting into, but I think of an observation as an entity, ala Data Cube Observations [1]. Jeni [1] http://www.w3.org/TR/vocab-data-cube/#reference-observations ------------------------------------------------------ From: Dan Brickley danbri@google.com Reply: Dan Brickley danbri@google.com Date: 2 March 2014 at 20:42:32 To: Jeni Tennison jeni@jenitennison.com Subject: Re: Scoping: "Tabular Data" > > On 2 Mar 2014 11:18, "Jeni Tennison" > wrote: > > > > Davide, > > > > I’ve updated the spec here: > > > > http://w3c.github.io/csvw/syntax/ > > > > with the definition that I think we agreed to (though I’m happy > to > continue wordsmithing it), namely that each row contains information > about > some (one) thing. > > We should allow two common cases: > > Each row is about a different entity (or set of). > > And > > Each row is an observation on the state of such an entity/entities. > > If we want we can say one entity per row is somehow the primary focus, > though describing that one thing is often achieved by mentioning > properties > of others. > > The latter allows for log-like and time series data, the former > for more > entity -relationship structures. > > Dan > > > Jeni > > > > ------------------------------------------------------ > > From: Ceolin, D. d.ceolin@vu.nl > > Reply: Ceolin, D. d.ceolin@vu.nl > > Date: 2 March 2014 at 17:52:00 > > To: Jeni Tennison jeni@jenitennison.com > > Subject: Re: Scoping: "Tabular Data" > > > > > > > > Hi Jeni, > > > > > > that's clear, thanks. What about the meaning of each row? (sorry > > > for being pedantic...) > > > Best, > > > > > > Davide > > > > > > Il giorno 01/mar/2014, alle ore 23.31, Jeni Tennison ha scritto: > > > > > > > Davide, > > > > > > > > I think the upshot of the discussion was that we came to an agreement > > > that in *tabular* data, each column has a consistent meaning > > > across all rows. > > > > > > > > I’m not sure that conclusion addresses your query. > > > > > > > > Jeni > > > > > > > > ------------------------------------------------------ > > > > From: Ceolin, D. d.ceolin@vu.nl > > > > Reply: Ceolin, D. d.ceolin@vu.nl > > > > Date: 28 February 2014 at 11:42:11 > > > > To: Jeni Tennison jeni@jenitennison.com > > > > Subject: Re: Scoping: "Tabular Data" > > > > > > > >> > > > >> Hi all, > > > >> > > > >> I'm adding Tim's use case to the "use case and requirements > > > doc", > > > >> and I was wondering what conclusion we drew from this discussion, > > > >> if any. > > > >> In particular, I'd say that not only in Tim's case “Each row > > > is > > > >> a statement”, but also "Each row is a statement and possibly > > > one > > > >> or more annotations about that statement". > > > >> This may add some ambiguity (e.g. is the confidence related > > > only > > > >> to the triple or to the triple and its provenance?), but offers > > > >> also an easy way to annotate statements (and, BTW, how would > > > that > > > >> be translated into RDF? By means of reification or else? > I'm > > > very > > > >> interested in trust value representations and related). > > > >> Also, I'm not sure if these issues are fully covered by the > PrimaryKey > > > >> and SemanticTypeDefinition requirements. > > > >> Cheers, > > > >> > > > >> Davide > > > >> > > > >> > > > >>> In your bitmap case, you can say: > > > >>> > > > >>> “Each row is a *row of a bitmap* and the columns are the *first > > > >> pixel*, *second pixel*, *third pixel*... of the row.” > > > >>> > > > >>> Conversely, in Tim’s case, you can say “Each row is a statement”, > > > >> but you can’t name the columns in a regular way in terms of > being > > > >> a property of each statement. > > > >>> > > > >>> Cheers, > > > >>> > > > >>> Jeni > > > >>> > > > >>> (*) or “represents” or “contains information about” or > whatever > > > >> you want to say to be more semantically accurate > > > >>> > > > >>> ------------------------------------------------------ > > > >>> From: Dan Brickley danbri@google.com > > > >>> Reply: Dan Brickley danbri@google.com > > > >>> Date: 23 February 2014 at 16:09:18 > > > >>> To: Jeni Tennison jeni@theodi.org > > > >>> Subject: Re: Scoping: "Tabular Data" > > > >>> > > > >>>> > > > >>>> On 23 February 2014 15:19, Jeni Tennison > > > >>>> wrote: > > > >>>>> Hi, > > > >>>>> > > > >>>>> Another scoping question, brought up from Tim Finin’s > > > example > > > >>>> from: > > > >>>>> > > > >>>>> > https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_and_facts_extracted_from_text > > > >>>>> > > > >>>>> 1> :e4 type PER > > > >>>>> 2> :e4 mention "Bart" D00124 283-286 > > > >>>>> 3> :e4 mention "JoJo" D00124 145-149 0.9 > > > >>>>> 4> :e4 per:siblings :e7 D00124 283-286 173-179 274-281 > > > >>>>> 5> :e4 per:age "10" D00124 180-181 173-179 182-191 0.9 > > > >>>>> 6> :e4 per:parent :e9 D00124 180-181 381-380 399-406 > D00101 > > > >>>> 220-225 230-233 201-210 > > > >>>>> ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ > > > >>>>> 1 2 3 4 5 6 7 8 9 10 11 > > > >>>>> > > > >>>>> (I’ve added numbers for the implied columns.) > > > >>>>> > > > >>>>> To me, this looks like a text-based format in which each > > > line > > > >>>> has a defined format, but where there isn’t the commonality > > > >> between > > > >>>> values in a single column that I would normally expect > in > > > what > > > >>>> I would consider a tabular format. > > > >>>>> > > > >>>>> So for example, column 6 contains a certainty value on > line > > > >> 3 > > > >>>> and an offset range in lines 4-6, while column 8 contains > > > a certainty > > > >>>> value on line 5 and a document ID on line 6. > > > >>>>> > > > >>>>> If the data looked like (comma separators added for clarity): > > > >>>>> > > > >>>>> :e4, type, PER, , > > > >>>>> :e4, mention, ”Bart”, D00124 283-286, > > > >>>>> :e4, mention, ”JoJo”, D00124 145-149, 0.9 > > > >>>>> :e4, per:siblings, :e7, D00124 283-286 173-179 274-281, > > > >>>>> :e4, per:age, "10" D00124 180-181 173-179 182-191, > 0.9 > > > >>>>> :e4, per:parent, :e9 D00124 180-181 381-380 399-406 > D00101 > > > >>>> 220-225 230-233 201-210, > > > >>>>> ^ ^ ^ ^ ^ > > > >>>>> 1 2 3 4 5 > > > >>>>> > > > >>>>> then I would consider it tabular data and could add headers: > > > >>>>> > > > >>>>> 1: subject > > > >>>>> 2: predicate > > > >>>>> 3: object > > > >>>>> 4: location > > > >>>>> 5: certainty > > > >>>>> > > > >>>>> Can/should we define tabular data as data where all values > > > >> in > > > >>>> a given column have a common meaning? > > > >>>> > > > >>>> In this last form, you might argue that when relationship > > > typing > > > >>>> is > > > >>>> pushed down into cell values, i.e. potentially a different > > > >> predicate > > > >>>> in each row, then that column does not really have a "common > > > >> meaning". > > > >>>> Or you might say the column does have a broader fixed meaning: > > > >>>> it > > > >>>> carries information about how values from other columns > > > relate > > > >>>> to each > > > >>>> other. > > > >>>> > > > >>>> For the sake of thought experiment I find it useful to come > > > back > > > >>>> to > > > >>>> pixel-style representation. Consider a 640x480 grid > in > > > which > > > >>>> red-ness, > > > >>>> green-ness and blue-ness values are packed into each > cell. > > > >> Perhaps > > > >>>> with a sub-notation using ':', on a 0-1 scale for now: > > > >>>> > > > >>>> So, > > > >>>> > > > >>>> 0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0 > > > >>>> 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0 > > > ... > > > >> might > > > >>>> give us a > > > >>>> fragment of such a grid, with neon, black, white etc cells. > > > >>>> > > > >>>> Q: Do these columns have regular meaning? > > > >>>> A: Yes; they stand for a column of pixels in a bitmap > > > >>>> A: No; each row-column combination stands for a distinct > > > entity > > > >>>> (pixel value) > > > >>>> > > > >>>> Q: Is it useful to use W3C CSVW's work to describe this? > > > >>>> A: Sure. It can help us get the syntax details right (whitespace, > > > >>>> quotes, newlines) between tools; and it can provide arbitrary > > > >>>> per-file > > > >>>> metadata. For example the metadata might tell us that > the > > > grid > > > >>>> of > > > >>>> colours comes from dan's security camera photo at such-and-so > > > >>>> a date. > > > >>>> > > > >>>> Q: Isn't this iffy, since there are much better binary > representations > > > >>>> for such data? (e.g. digital image formats) > > > >>>> A: Yes, but that can be true for more obviously factual > data > > > >> too. > > > >>>> > > > >>>> Maybe what I'm getting at here is that I'm not sure what > "a > > > common > > > >>>> meaning" for columns might mean. On the last call I tried > > > to > > > >> talk > > > >>>> about columns being "homogenous" but that was more in > terms > > > >> of > > > >>>> low > > > >>>> level data-typing. For example, a column might always > contain > > > >>>> ISO-8601-style dates, i.e. YYYY-MM-DD. But what they > *mean* > > > >>>> (birthdate, deathdate, date hired, favourite date, > ...) > > > >> could > > > >>>> be fixed > > > >>>> by the meaning of a different column. So the column could > > > be > > > >>>> datatype-homogenous but the nature of it's per-cell > meaning > > > >>>> could vary > > > >>>> per cell. > > > >>>> > > > >>>> Dan > > > >>>> > > > >>>> > > > >>>> > > > >>> > > > >>> -- > > > >>> Jeni Tennison > > > >>> http://www.jenitennison.com/ > > > >>> > > > >> > > > >> > > > >> > > > >> > > > > > > > > -- > > > > Jeni Tennison > > > > http://www.jenitennison.com/ > > > > > > > > > > > > > > > > -- > > Jeni Tennison > > http://www.jenitennison.com/ > -- Jeni Tennison http://www.jenitennison.com/
Received on Sunday, 2 March 2014 21:06:00 UTC