Re: Scoping: "Tabular Data"

Hi all,

I'm adding Tim's use case to the "use case and requirements doc", and I was wondering what conclusion we drew from this discussion, if any.
In particular, I'd say that not only in Tim's case “Each row is a statement”, but also "Each row is a statement and possibly one or more annotations about that statement".
This may add some ambiguity (e.g. is the confidence related only to the triple or to the triple and its provenance?), but offers also an easy way to annotate statements (and, BTW, how would that be translated into RDF? By means of reification or else? I'm very interested in trust value representations and related).
Also, I'm not sure if these issues are fully covered by the PrimaryKey and SemanticTypeDefinition requirements.
Cheers,

Davide


> In your bitmap case, you can say:
> 
>   “Each row is a *row of a bitmap* and the columns are the *first pixel*, *second pixel*, *third pixel*... of the row.”
> 
> Conversely, in Tim’s case, you can say “Each row is a statement”, but you can’t name the columns in a regular way in terms of being a property of each statement.
> 
> Cheers,
> 
> Jeni
> 
> (*) or “represents” or “contains information about” or whatever you want to say to be more semantically accurate
> 
> ------------------------------------------------------
> From: Dan Brickley danbri@google.com
> Reply: Dan Brickley danbri@google.com
> Date: 23 February 2014 at 16:09:18
> To: Jeni Tennison jeni@theodi.org
> Subject:  Re: Scoping: "Tabular Data"
> 
>> 
>> On 23 February 2014 15:19, Jeni Tennison  
>> wrote:
>>> Hi,
>>> 
>>> Another scoping question, brought up from Tim Finin’s example  
>> from:
>>> 
>>> https://www.w3.org/2013/csvw/wiki/Use_Cases#Representing_entitles_and_facts_extracted_from_text  
>>> 
>>> 1> :e4 type PER
>>> 2> :e4 mention "Bart" D00124 283-286
>>> 3> :e4 mention "JoJo" D00124 145-149 0.9
>>> 4> :e4 per:siblings :e7 D00124 283-286 173-179 274-281
>>> 5> :e4 per:age "10" D00124 180-181 173-179 182-191 0.9
>>> 6> :e4 per:parent :e9 D00124 180-181 381-380 399-406 D00101  
>> 220-225 230-233 201-210
>>> ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
>>> 1 2 3 4 5 6 7 8 9 10 11
>>> 
>>> (I’ve added numbers for the implied columns.)
>>> 
>>> To me, this looks like a text-based format in which each line  
>> has a defined format, but where there isn’t the commonality between  
>> values in a single column that I would normally expect in what  
>> I would consider a tabular format.
>>> 
>>> So for example, column 6 contains a certainty value on line 3  
>> and an offset range in lines 4-6, while column 8 contains a certainty  
>> value on line 5 and a document ID on line 6.
>>> 
>>> If the data looked like (comma separators added for clarity):  
>>> 
>>> :e4, type, PER, ,
>>> :e4, mention, ”Bart”, D00124 283-286,
>>> :e4, mention, ”JoJo”, D00124 145-149, 0.9
>>> :e4, per:siblings, :e7, D00124 283-286 173-179 274-281,
>>> :e4, per:age, "10" D00124 180-181 173-179 182-191, 0.9
>>> :e4, per:parent, :e9 D00124 180-181 381-380 399-406 D00101  
>> 220-225 230-233 201-210,
>>> ^ ^ ^ ^ ^
>>> 1 2 3 4 5
>>> 
>>> then I would consider it tabular data and could add headers:  
>>> 
>>> 1: subject
>>> 2: predicate
>>> 3: object
>>> 4: location
>>> 5: certainty
>>> 
>>> Can/should we define tabular data as data where all values in  
>> a given column have a common meaning?
>> 
>> In this last form, you might argue that when relationship typing  
>> is
>> pushed down into cell values, i.e. potentially a different predicate  
>> in each row, then that column does not really have a "common meaning".  
>> Or you might say the column does have a broader fixed meaning:  
>> it
>> carries information about how values from other columns relate  
>> to each
>> other.
>> 
>> For the sake of thought experiment I find it useful to come back  
>> to
>> pixel-style representation. Consider a 640x480 grid in which  
>> red-ness,
>> green-ness and blue-ness values are packed into each cell. Perhaps  
>> with a sub-notation using ':', on a 0-1 scale for now:
>> 
>> So,
>> 
>> 0.4:1.0:0.0, 0.0:0.0:0.0, 1.0:1.0;1.0, 0.4:1.0:0.0
>> 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0, 1.0:1.0:1.0 ... might  
>> give us a
>> fragment of such a grid, with neon, black, white etc cells.
>> 
>> Q: Do these columns have regular meaning?
>> A: Yes; they stand for a column of pixels in a bitmap
>> A: No; each row-column combination stands for a distinct entity  
>> (pixel value)
>> 
>> Q: Is it useful to use W3C CSVW's work to describe this?
>> A: Sure. It can help us get the syntax details right (whitespace,  
>> quotes, newlines) between tools; and it can provide arbitrary  
>> per-file
>> metadata. For example the metadata might tell us that the grid  
>> of
>> colours comes from dan's security camera photo at such-and-so  
>> a date.
>> 
>> Q: Isn't this iffy, since there are much better binary representations  
>> for such data? (e.g. digital image formats)
>> A: Yes, but that can be true for more obviously factual data too.  
>> 
>> Maybe what I'm getting at here is that I'm not sure what "a common  
>> meaning" for columns might mean. On the last call I tried to talk  
>> about columns being "homogenous" but that was more in terms of  
>> low
>> level data-typing. For example, a column might always contain  
>> ISO-8601-style dates, i.e. YYYY-MM-DD. But what they *mean*  
>> (birthdate, deathdate, date hired, favourite date, ...) could  
>> be fixed
>> by the meaning of a different column. So the column could be
>> datatype-homogenous but the nature of it's per-cell meaning  
>> could vary
>> per cell.
>> 
>> Dan
>> 
>> 
>> 
> 
> --  
> Jeni Tennison
> http://www.jenitennison.com/
> 

Received on Friday, 28 February 2014 11:40:05 UTC