RE: Use machine-readable standardized data formats / Use non-proprietary data formats

Erik wrote:

> one person's model/reality is another person's data. trying to understand
> where to draw the line is a futile attempt with a long history of trying
> failing.

So maybe the reason we have never managed to decide what we mean by 'data'
is because it is not possible to define it and therefore our attempts have
been futile. Good point.

Maybe we need to look at it from a different angle. Here is what I think
could maybe be a way forward.

Someone mentioned the word 'context' in another thread, and maybe that is
what we need to look at.

One way of looking at context is how DCAT defines 'dataset': "A collection
of data, published or curated by a single agent, and available for access or
download in one or more formats". So not individual observations, sentences,
numbers, but data items that belong together in some sort of 'collection'.

My proposal would be not to try to define limits related to what the data
*is* or how it can be used but just to consider the context in which the
data exists or is embedded. If the context puts the data in some sensible
perspective, it's in scope; if it is just bits and pieces without a clear
context, it's out of scope.

Here are two examples that I imagined:

1. metereological information

* 31 degrees Celsius is just a temperature;
* The fact that 31 degrees Celsius was the maximum temperature today in the
village where I am is a piece of information.
My assumption is that this level is not what we want to be concerned with in
this group.

I think that we start getting interested if there is a collection of those
pieces of information, for example a list of today's maximum temperatures
across the whole province or country, or in a bigger context, when this is
part of the list of all maximum temperatures across the country for all days
of the year. As far as I understand, such lists are what DCAT would call

2. legal information

* A single sentence is just that;
* A legal article with some sentences is a piece of information.
Again, not the kinds of things that we're concerned with.

As soon as the articles are embedded in a complete legal act with
definitions and references, then it becomes again "a collection of data,
published or curated by a single agent, and available for access or download
in one or more formats" (a dataset) and therefore of interest to us. 

Happy to hear people's views on this.


Received on Friday, 14 August 2015 18:56:11 UTC