Re: Scoping: "Tabular Data" from Jeni Tennison on 2014-03-02 (public-csv-wg@w3.org from March 2014)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Sun, 2 Mar 2014 19:18:18 +0000
To: "Ceolin, D." <d.ceolin@vu.nl>
Cc: Dan Brickley <danbri@google.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <etPan.531383fb.579478fe.137@jenit.local>
Davide,

I’ve updated the spec here:

  http://w3c.github.io/csvw/syntax/

with the definition that I think we agreed to (though I’m happy to continue wordsmithing it), namely that each row contains information about some (one) thing.

Jeni

------------------------------------------------------
From: Ceolin, D. d.ceolin@vu.nl
Reply: Ceolin, D. d.ceolin@vu.nl
Date: 2 March 2014 at 17:52:00
To: Jeni Tennison jeni@jenitennison.com
Subject:  Re: Scoping: "Tabular Data"

>  
> Hi Jeni,
>  
> that's clear, thanks. What about the meaning of each row? (sorry  
> for being pedantic...)
> Best,
>  
> Davide
>  
> Il giorno 01/mar/2014, alle ore 23.31, Jeni Tennison ha scritto:  
>  
> > Davide,
> >
> > I think the upshot of the discussion was that we came to an agreement  
> that in *tabular* data, each column has a consistent meaning  
> across all rows.
> >
> > I’m not sure that conclusion addresses your query.
> >
> > Jeni
> >
> > ------------------------------------------------------  
> > From: Ceolin, D. d.ceolin@vu.nl
> > Reply: Ceolin, D. d.ceolin@vu.nl
> > Date: 28 February 2014 at 11:42:11
> > To: Jeni Tennison jeni@jenitennison.com
> > Subject: Re: Scoping: "Tabular Data"
> >
> >>
> >> Hi all,
> >>
> >> I'm adding Tim's use case to the "use case and requirements  
> doc",
> >> and I was wondering what conclusion we drew from this discussion,  
> >> if any.
> >> In particular, I'd say that not only in Tim's case “Each row  
> is
> >> a statement”, but also "Each row is a statement and possibly  
> one
> >> or more annotations about that statement".
> >> This may add some ambiguity (e.g. is the confidence related  
> only
> >> to the triple or to the triple and its provenance?), but offers  
> >> also an easy way to annotate statements (and, BTW, how would  
> that
> >> be translated into RDF? By means of reification or else? I'm  
> very
> >> interested in trust value representations and related).  
> >> Also, I'm not sure if these issues are fully covered by the PrimaryKey  
> >> and SemanticTypeDefinition requirements.
> >> Cheers,
> >>
> >> Davide
> >>
> >>
> >>> In your bitmap case, you can say:
> >>>
> >>> “Each row is a *row of a bitmap* and the columns are the *first  
> >> pixel*, *second pixel*, *third pixel*... of the row.”
> >>>
> >>> Conversely, in Tim’s case, you can say “Each row is a statement”,  
> >> but you can’t name the columns in a regular way in terms of being  
> >> a property of each statement.
> >>>
> >>> Cheers,
> >>>
> >>> Jeni
> >>>
> >>> (*) or “represents” or “contains information about” or whatever  
> >> you want to say to be more semantically accurate
> >>>
> >>> ------------------------------------------------------  
> >>> From: Dan Brickley danbri@google.com
> >>> Reply: Dan Brickley danbri@google.com
> >>> Date: 23 February 2014 at 16:09:18
> >>> To: Jeni Tennison jeni@theodi.org
> >>> Subject: Re: Scoping: "Tabular Data"
> >>>
> >>>>
> >>>> On 23 February 2014 15:19, Jeni Tennison
> >>>> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Another scoping question, brought up from Tim Finin’s  
> example
> >>>> from:
> >>>>>
> >>>>> https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_and_facts_extracted_from_text  
> >>>>>
> >>>>> 1> :e4 type PER
> >>>>> 2> :e4 mention "Bart" D00124 283-286
> >>>>> 3> :e4 mention "JoJo" D00124 145-149 0.9
> >>>>> 4> :e4 per:siblings :e7 D00124 283-286 173-179 274-281  
> >>>>> 5> :e4 per:age "10" D00124 180-181 173-179 182-191 0.9  
> >>>>> 6> :e4 per:parent :e9 D00124 180-181 381-380 399-406 D00101  
> >>>> 220-225 230-233 201-210
> >>>>> ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
> >>>>> 1 2 3 4 5 6 7 8 9 10 11
> >>>>>
> >>>>> (I’ve added numbers for the implied columns.)
> >>>>>
> >>>>> To me, this looks like a text-based format in which each  
> line
> >>>> has a defined format, but where there isn’t the commonality  
> >> between
> >>>> values in a single column that I would normally expect in  
> what
> >>>> I would consider a tabular format.
> >>>>>
> >>>>> So for example, column 6 contains a certainty value on line  
> >> 3
> >>>> and an offset range in lines 4-6, while column 8 contains  
> a certainty
> >>>> value on line 5 and a document ID on line 6.
> >>>>>
> >>>>> If the data looked like (comma separators added for clarity):  
> >>>>>
> >>>>> :e4, type, PER, ,
> >>>>> :e4, mention, ”Bart”, D00124 283-286,
> >>>>> :e4, mention, ”JoJo”, D00124 145-149, 0.9
> >>>>> :e4, per:siblings, :e7, D00124 283-286 173-179 274-281,  
> >>>>> :e4, per:age, "10" D00124 180-181 173-179 182-191, 0.9  
> >>>>> :e4, per:parent, :e9 D00124 180-181 381-380 399-406 D00101  
> >>>> 220-225 230-233 201-210,
> >>>>> ^ ^ ^ ^ ^
> >>>>> 1 2 3 4 5
> >>>>>
> >>>>> then I would consider it tabular data and could add headers:  
> >>>>>
> >>>>> 1: subject
> >>>>> 2: predicate
> >>>>> 3: object
> >>>>> 4: location
> >>>>> 5: certainty
> >>>>>
> >>>>> Can/should we define tabular data as data where all values  
> >> in
> >>>> a given column have a common meaning?
> >>>>
> >>>> In this last form, you might argue that when relationship  
> typing
> >>>> is
> >>>> pushed down into cell values, i.e. potentially a different  
> >> predicate
> >>>> in each row, then that column does not really have a "common  
> >> meaning".
> >>>> Or you might say the column does have a broader fixed meaning:  
> >>>> it
> >>>> carries information about how values from other columns  
> relate
> >>>> to each
> >>>> other.
> >>>>
> >>>> For the sake of thought experiment I find it useful to come  
> back
> >>>> to
> >>>> pixel-style representation. Consider a 640x480 grid in  
> which
> >>>> red-ness,
> >>>> green-ness and blue-ness values are packed into each cell.  
> >> Perhaps
> >>>> with a sub-notation using ':', on a 0-1 scale for now:
> >>>>
> >>>> So,
> >>>>
> >>>> 0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0  
> >>>> 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0  
> ...
> >> might
> >>>> give us a
> >>>> fragment of such a grid, with neon, black, white etc cells.  
> >>>>
> >>>> Q: Do these columns have regular meaning?
> >>>> A: Yes; they stand for a column of pixels in a bitmap
> >>>> A: No; each row-column combination stands for a distinct  
> entity
> >>>> (pixel value)
> >>>>
> >>>> Q: Is it useful to use W3C CSVW's work to describe this?
> >>>> A: Sure. It can help us get the syntax details right (whitespace,  
> >>>> quotes, newlines) between tools; and it can provide arbitrary  
> >>>> per-file
> >>>> metadata. For example the metadata might tell us that the  
> grid
> >>>> of
> >>>> colours comes from dan's security camera photo at such-and-so  
> >>>> a date.
> >>>>
> >>>> Q: Isn't this iffy, since there are much better binary representations  
> >>>> for such data? (e.g. digital image formats)
> >>>> A: Yes, but that can be true for more obviously factual data  
> >> too.
> >>>>
> >>>> Maybe what I'm getting at here is that I'm not sure what "a  
> common
> >>>> meaning" for columns might mean. On the last call I tried  
> to
> >> talk
> >>>> about columns being "homogenous" but that was more in terms  
> >> of
> >>>> low
> >>>> level data-typing. For example, a column might always contain  
> >>>> ISO-8601-style dates, i.e. YYYY-MM-DD. But what they *mean*  
> >>>> (birthdate, deathdate, date hired, favourite date, ...)  
> >> could
> >>>> be fixed
> >>>> by the meaning of a different column. So the column could  
> be
> >>>> datatype-homogenous but the nature of it's per-cell meaning  
> >>>> could vary
> >>>> per cell.
> >>>>
> >>>> Dan
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Jeni Tennison
> >>> http://www.jenitennison.com/
> >>>
> >>
> >>
> >>
> >>
> >
> > --
> > Jeni Tennison
> > http://www.jenitennison.com/
>  
>  
>  
>  

--  
Jeni Tennison
http://www.jenitennison.com/
Received on Sunday, 2 March 2014 19:18:42 UTC