Re: Use machine-readable standardized data formats / Use non-proprietary data formats

The notion of context is what I was trying to get to by talking about a set of things that are not intended to be meaningful on their own. As you say, 31 degrees C is not meaningful out of context. In a dataset with other temperature readings and metadata, it has meaning. So I think we have some overlap there. But you are thinking about it a little differently and end up with more stuff in scope than me. I don’t think that legislation is something this group should be considering as data. (That gets into structured documents, which I view as a slippery slope into everything-land.) If we ruled out everything but collections whose pieces *lack* meaning when used alone, I think legislation is ruled out, as are the rest of the things that make me worry about boiling the ocean. A section of a law conveys a lot of meaning. What I don’t quite understand from your message is what you are thinking of that would be ruled out of scope by a rule that says anything put into some sensible perspective by having context is in scope. Doesn’t context put everything into perspective?

After thinking about graph data a bit, I’m liking the tabular notion more. Since graph data are basically matrices, and matrices are really a form of table, that’s not so difficult to rule in after all. If you define JSON as tabular, we are probably in agreement. I’m just not sure that the word tabular would be interpreted by most readers as including key-value stores, but we could clarify that in a sentence. Or we could just rule in a short list of forms, like tabular and key-value. Legislation is clearly ruled out if we go with tabular data. To be clear, I’m thinking of stuff that *is in* a “tabular” form, not just anything that could be represented that way, because anything can. The same for graph data, as Erik points out. I don’t think we should rule in everything that *could be* expressed by a graph representation, but I would rule in anything that *is*. So, if you want to make a matrix of your relationships with your cousins and publish those on the web, we have some guidance for you, but as for the photos you took at the family picnic, you’re on your own.

-Annette



--
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
510-495-2935

On Aug 14, 2015, at 11:55 AM, Makx Dekkers <mail@makxdekkers.com> wrote:

> 
> Erik wrote:
> 
>> one person's model/reality is another person's data. trying to understand
>> where to draw the line is a futile attempt with a long history of trying
> and
>> failing.
> 
> So maybe the reason we have never managed to decide what we mean by 'data'
> is because it is not possible to define it and therefore our attempts have
> been futile. Good point.
> 
> Maybe we need to look at it from a different angle. Here is what I think
> could maybe be a way forward.
> 
> Someone mentioned the word 'context' in another thread, and maybe that is
> what we need to look at.
> 
> One way of looking at context is how DCAT defines 'dataset': "A collection
> of data, published or curated by a single agent, and available for access or
> download in one or more formats". So not individual observations, sentences,
> numbers, but data items that belong together in some sort of 'collection'.
> 
> My proposal would be not to try to define limits related to what the data
> *is* or how it can be used but just to consider the context in which the
> data exists or is embedded. If the context puts the data in some sensible
> perspective, it's in scope; if it is just bits and pieces without a clear
> context, it's out of scope.
> 
> Here are two examples that I imagined:
> 
> 1. metereological information
> 
> * 31 degrees Celsius is just a temperature;
> * The fact that 31 degrees Celsius was the maximum temperature today in the
> village where I am is a piece of information.
> My assumption is that this level is not what we want to be concerned with in
> this group.
> 
> I think that we start getting interested if there is a collection of those
> pieces of information, for example a list of today's maximum temperatures
> across the whole province or country, or in a bigger context, when this is
> part of the list of all maximum temperatures across the country for all days
> of the year. As far as I understand, such lists are what DCAT would call
> 'datasets'.
> 
> 2. legal information
> 
> * A single sentence is just that;
> * A legal article with some sentences is a piece of information.
> Again, not the kinds of things that we're concerned with.
> 
> As soon as the articles are embedded in a complete legal act with
> definitions and references, then it becomes again "a collection of data,
> published or curated by a single agent, and available for access or download
> in one or more formats" (a dataset) and therefore of interest to us. 
> 
> 
> Happy to hear people's views on this.
> 
> Makx.
> 
> 
> 
> 

Received on Friday, 14 August 2015 22:55:10 UTC