Re: Scoping: "Tabular Data" from Jeni Tennison on 2014-03-02 (public-csv-wg@w3.org from March 2014)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Sun, 2 Mar 2014 21:05:29 +0000
To: Dan Brickley <danbri@google.com>
Cc: public-csv-wg@w3.org, "Ceolin, D." <d.ceolin@vu.nl>
Message-ID: <etPan.53139d19.3dc240fb.137@jenit.local>
It might not be worth getting into, but I think of an observation as an entity, ala Data Cube Observations [1].

Jeni

[1] http://www.w3.org/TR/vocab-data-cube/#reference-observations

------------------------------------------------------
From: Dan Brickley danbri@google.com
Reply: Dan Brickley danbri@google.com
Date: 2 March 2014 at 20:42:32
To: Jeni Tennison jeni@jenitennison.com
Subject:  Re: Scoping: "Tabular Data"

>  
> On 2 Mar 2014 11:18, "Jeni Tennison"  
> wrote:
> >
> > Davide,
> >
> > I’ve updated the spec here:
> >
> > http://w3c.github.io/csvw/syntax/
> >
> > with the definition that I think we agreed to (though I’m happy  
> to
> continue wordsmithing it), namely that each row contains information  
> about
> some (one) thing.
>  
> We should allow two common cases:
>  
> Each row is about a different entity (or set of).
>  
> And
>  
> Each row is an observation on the state of such an entity/entities.  
>  
> If we want we can say one entity per row is somehow the primary focus,  
> though describing that one thing is often achieved by mentioning  
> properties
> of others.
>  
> The latter allows for log-like and time series data, the former  
> for more
> entity -relationship structures.
>  
> Dan
>  
> > Jeni
> >
> > ------------------------------------------------------  
> > From: Ceolin, D. d.ceolin@vu.nl
> > Reply: Ceolin, D. d.ceolin@vu.nl
> > Date: 2 March 2014 at 17:52:00
> > To: Jeni Tennison jeni@jenitennison.com
> > Subject: Re: Scoping: "Tabular Data"
> >
> > >
> > > Hi Jeni,
> > >
> > > that's clear, thanks. What about the meaning of each row? (sorry  
> > > for being pedantic...)
> > > Best,
> > >
> > > Davide
> > >
> > > Il giorno 01/mar/2014, alle ore 23.31, Jeni Tennison ha scritto:  
> > >
> > > > Davide,
> > > >
> > > > I think the upshot of the discussion was that we came to an agreement  
> > > that in *tabular* data, each column has a consistent meaning  
> > > across all rows.
> > > >
> > > > I’m not sure that conclusion addresses your query.
> > > >
> > > > Jeni
> > > >
> > > > ------------------------------------------------------  
> > > > From: Ceolin, D. d.ceolin@vu.nl
> > > > Reply: Ceolin, D. d.ceolin@vu.nl
> > > > Date: 28 February 2014 at 11:42:11
> > > > To: Jeni Tennison jeni@jenitennison.com
> > > > Subject: Re: Scoping: "Tabular Data"
> > > >
> > > >>
> > > >> Hi all,
> > > >>
> > > >> I'm adding Tim's use case to the "use case and requirements  
> > > doc",
> > > >> and I was wondering what conclusion we drew from this discussion,  
> > > >> if any.
> > > >> In particular, I'd say that not only in Tim's case “Each row  
> > > is
> > > >> a statement”, but also "Each row is a statement and possibly  
> > > one
> > > >> or more annotations about that statement".
> > > >> This may add some ambiguity (e.g. is the confidence related  
> > > only
> > > >> to the triple or to the triple and its provenance?), but offers  
> > > >> also an easy way to annotate statements (and, BTW, how would  
> > > that
> > > >> be translated into RDF? By means of reification or else?  
> I'm
> > > very
> > > >> interested in trust value representations and related).  
> > > >> Also, I'm not sure if these issues are fully covered by the  
> PrimaryKey
> > > >> and SemanticTypeDefinition requirements.
> > > >> Cheers,
> > > >>
> > > >> Davide
> > > >>
> > > >>
> > > >>> In your bitmap case, you can say:
> > > >>>
> > > >>> “Each row is a *row of a bitmap* and the columns are the *first  
> > > >> pixel*, *second pixel*, *third pixel*... of the row.”
> > > >>>
> > > >>> Conversely, in Tim’s case, you can say “Each row is a statement”,  
> > > >> but you can’t name the columns in a regular way in terms of  
> being
> > > >> a property of each statement.
> > > >>>
> > > >>> Cheers,
> > > >>>
> > > >>> Jeni
> > > >>>
> > > >>> (*) or “represents” or “contains information about” or  
> whatever
> > > >> you want to say to be more semantically accurate
> > > >>>
> > > >>> ------------------------------------------------------  
> > > >>> From: Dan Brickley danbri@google.com
> > > >>> Reply: Dan Brickley danbri@google.com
> > > >>> Date: 23 February 2014 at 16:09:18
> > > >>> To: Jeni Tennison jeni@theodi.org
> > > >>> Subject: Re: Scoping: "Tabular Data"
> > > >>>
> > > >>>>
> > > >>>> On 23 February 2014 15:19, Jeni Tennison
> > > >>>> wrote:
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> Another scoping question, brought up from Tim Finin’s  
> > > example
> > > >>>> from:
> > > >>>>>
> > > >>>>>
> https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_and_facts_extracted_from_text  
> > > >>>>>
> > > >>>>> 1> :e4 type PER
> > > >>>>> 2> :e4 mention "Bart" D00124 283-286
> > > >>>>> 3> :e4 mention "JoJo" D00124 145-149 0.9
> > > >>>>> 4> :e4 per:siblings :e7 D00124 283-286 173-179 274-281  
> > > >>>>> 5> :e4 per:age "10" D00124 180-181 173-179 182-191 0.9  
> > > >>>>> 6> :e4 per:parent :e9 D00124 180-181 381-380 399-406  
> D00101
> > > >>>> 220-225 230-233 201-210
> > > >>>>> ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
> > > >>>>> 1 2 3 4 5 6 7 8 9 10 11
> > > >>>>>
> > > >>>>> (I’ve added numbers for the implied columns.)
> > > >>>>>
> > > >>>>> To me, this looks like a text-based format in which each  
> > > line
> > > >>>> has a defined format, but where there isn’t the commonality  
> > > >> between
> > > >>>> values in a single column that I would normally expect  
> in
> > > what
> > > >>>> I would consider a tabular format.
> > > >>>>>
> > > >>>>> So for example, column 6 contains a certainty value on  
> line
> > > >> 3
> > > >>>> and an offset range in lines 4-6, while column 8 contains  
> > > a certainty
> > > >>>> value on line 5 and a document ID on line 6.
> > > >>>>>
> > > >>>>> If the data looked like (comma separators added for clarity):  
> > > >>>>>
> > > >>>>> :e4, type, PER, ,
> > > >>>>> :e4, mention, ”Bart”, D00124 283-286,
> > > >>>>> :e4, mention, ”JoJo”, D00124 145-149, 0.9
> > > >>>>> :e4, per:siblings, :e7, D00124 283-286 173-179 274-281,  
> > > >>>>> :e4, per:age, "10" D00124 180-181 173-179 182-191,  
> 0.9
> > > >>>>> :e4, per:parent, :e9 D00124 180-181 381-380 399-406  
> D00101
> > > >>>> 220-225 230-233 201-210,
> > > >>>>> ^ ^ ^ ^ ^
> > > >>>>> 1 2 3 4 5
> > > >>>>>
> > > >>>>> then I would consider it tabular data and could add headers:  
> > > >>>>>
> > > >>>>> 1: subject
> > > >>>>> 2: predicate
> > > >>>>> 3: object
> > > >>>>> 4: location
> > > >>>>> 5: certainty
> > > >>>>>
> > > >>>>> Can/should we define tabular data as data where all values  
> > > >> in
> > > >>>> a given column have a common meaning?
> > > >>>>
> > > >>>> In this last form, you might argue that when relationship  
> > > typing
> > > >>>> is
> > > >>>> pushed down into cell values, i.e. potentially a different  
> > > >> predicate
> > > >>>> in each row, then that column does not really have a "common  
> > > >> meaning".
> > > >>>> Or you might say the column does have a broader fixed meaning:  
> > > >>>> it
> > > >>>> carries information about how values from other columns  
> > > relate
> > > >>>> to each
> > > >>>> other.
> > > >>>>
> > > >>>> For the sake of thought experiment I find it useful to come  
> > > back
> > > >>>> to
> > > >>>> pixel-style representation. Consider a 640x480 grid  
> in
> > > which
> > > >>>> red-ness,
> > > >>>> green-ness and blue-ness values are packed into each  
> cell.
> > > >> Perhaps
> > > >>>> with a sub-notation using ':', on a 0-1 scale for now:
> > > >>>>
> > > >>>> So,
> > > >>>>
> > > >>>> 0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0  
> > > >>>> 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0  
> > > ...
> > > >> might
> > > >>>> give us a
> > > >>>> fragment of such a grid, with neon, black, white etc cells.  
> > > >>>>
> > > >>>> Q: Do these columns have regular meaning?
> > > >>>> A: Yes; they stand for a column of pixels in a bitmap
> > > >>>> A: No; each row-column combination stands for a distinct  
> > > entity
> > > >>>> (pixel value)
> > > >>>>
> > > >>>> Q: Is it useful to use W3C CSVW's work to describe this?  
> > > >>>> A: Sure. It can help us get the syntax details right (whitespace,  
> > > >>>> quotes, newlines) between tools; and it can provide arbitrary  
> > > >>>> per-file
> > > >>>> metadata. For example the metadata might tell us that  
> the
> > > grid
> > > >>>> of
> > > >>>> colours comes from dan's security camera photo at such-and-so  
> > > >>>> a date.
> > > >>>>
> > > >>>> Q: Isn't this iffy, since there are much better binary  
> representations
> > > >>>> for such data? (e.g. digital image formats)
> > > >>>> A: Yes, but that can be true for more obviously factual  
> data
> > > >> too.
> > > >>>>
> > > >>>> Maybe what I'm getting at here is that I'm not sure what  
> "a
> > > common
> > > >>>> meaning" for columns might mean. On the last call I tried  
> > > to
> > > >> talk
> > > >>>> about columns being "homogenous" but that was more in  
> terms
> > > >> of
> > > >>>> low
> > > >>>> level data-typing. For example, a column might always  
> contain
> > > >>>> ISO-8601-style dates, i.e. YYYY-MM-DD. But what they  
> *mean*
> > > >>>> (birthdate, deathdate, date hired, favourite date,  
> ...)
> > > >> could
> > > >>>> be fixed
> > > >>>> by the meaning of a different column. So the column could  
> > > be
> > > >>>> datatype-homogenous but the nature of it's per-cell  
> meaning
> > > >>>> could vary
> > > >>>> per cell.
> > > >>>>
> > > >>>> Dan
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>> --
> > > >>> Jeni Tennison
> > > >>> http://www.jenitennison.com/
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > > > --
> > > > Jeni Tennison
> > > > http://www.jenitennison.com/
> > >
> > >
> > >
> > >
> >
> > --
> > Jeni Tennison
> > http://www.jenitennison.com/
>  

--  
Jeni Tennison
http://www.jenitennison.com/
Received on Sunday, 2 March 2014 21:06:00 UTC