RE: Use machine-readable standardized data formats / Use non-proprietary data formats


Data is a hard problem and this is aiming quite high:

  "... the web as an electronic delivery mechanism for structured data in open formats ..."

Other groups address visualisation, etc.

We are the miller group with the objective to produce standardised flour: not over-glamorous, but necessary . Other groups are for bakery, pastry, etc. :-)


From: Mark Harrison []
Sent: 13 August 2015 07:51
To: Annette Greiner; CARRASCO BENITEZ Manuel (DGT)
Cc:; Mark Harrison;
Subject: Re: Use machine-readable standardized data formats / Use   non-proprietary data formats

Hi Annette,

I completely agree with you that the discussion should be about how to encourage people to move beyond / away from publishing static immutable documents and towards publishing live (data + models + interactive visualisations) on the web that are open, interactive and collaborative and make it as easy as possible for people and machines to retrieve, combine, compare, re-analyse and re-visualise data from multiple sources just as easily as people can use web technology to collaborate on open source software today.

If our focus appears to be primarily on the web as an electronic delivery mechanism for structured data in open formats, we're probably aiming far too low and not giving people enough of a bold vision about what live, interactive, collaborative, mashable data on the web could be like in the future.

There are already some sites such as that are making good progress in that direction.  There are also toolkits and frameworks such as d3.js that make this vision easier to achieve.  We can probably find and critique other examples and comment on the aspects that they do well, as well as aspects where they could improve further.  In this way, we can explain the big vision for what 'data on the web' really could be, if done well.

As Erik says, it needs to be webby.  That could mean that the raw data and the data transformations and visualisation are all fully interlinked on the web in the finest detail, potentially down to the granularity of each individual datapoint.  Furthermore, if we want to find related datasets for comparison, we should be able to easily retrieve those and overlay them within the same live visualisation - or even try modelling or visualising the data in different ways, all interactively and collaboratively on the web.

Even with 5-star linked open data, we can link to existing data but cannot immediately link to future data that has not yet been generated - so instead we also need to provide rich metadata that describes the scope, coverage and granularity of the data well.  In future, we might expect that web search engines can not only help us to retrieve datasets and their metadata - but allow us to tweak any of the metadata parameters in order to search for related datasets, e.g. to find similar economic data about a different country or different organisation - or to find related scientific data for a related material - or for the same material studied using a different but related experimental technique, so that we can compare the data easily, without having to spend so much effort tracking down the data, reverse-engineering charts and graphs to extract data, etc.

To some extent, web technology already exists to enable the whole Data Model, View and Controller to all be entirely web-based, resulting in a live, interactive, collaborative space for data sharing and analysis, which has so many advantages over static published documents.  My reference to D3.js was one example of such technology.  I think it's a good thing to point people to multiple toolkits and frameworks that they can already use to implement the bold vision of truly collaborative, interactive data on the web.

I think we would miss a great opportunity if this group cannot clearly explain to everyone (including any member of the public) what that bold vision for 'data on the web' could be like.  It could go far beyond providing datasets via the web.

Some people may take the time to read rather dry documents of best practices and might even understand some of them.  Others may understand the vision better if we can point to existing real examples of 'data on the web done very well' and explain which aspects they currently do very well - and what they could do even better.  The 'gold standard' is probably a blend of the best aspects of several existing examples.

When everyone can understand how data that is truly live on the web has the potential to greatly increase the efficiency of research and data analysis and generation of new insights in so many different fields, then the best practices documents from this group become a highly relevant and practical step-by-step instruction manual to help everyone achieve that vision.

Best wishes,

- Mark

From: Annette Greiner <>
Sent: 12 August 2015 18:31
Cc:; Mark Harrison;
Subject: Re: Use machine-readable standardized data formats / Use   non-proprietary data formats

Youíre not seriously suggesting people should make data available in word perfect format, are you?
This discussion seems to be wandering into the realm of publishing documents.

Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory

On Aug 12, 2015, at 7:28 AM, wrote:

> One should have at least the following variants of the resource:
> - Original     : foo.wp  - WordPerfect 3.0 ~1982, perhaps still processable
> - Content      : foo.txt - textual, hopefully processable in 100 years
> - Presentation : foo.tif - TIFF ~1986, perhaps still viewable, might be
> So:
>  -     - negotiate and give me the best
>  -  - I can still process WP
>  - - I want to process the text, no presentation
>  - - I really want to see how the doc looks
> Regards
> Tomas
>> Perhaps the way we can formulate this is to say that some document
>> formats (such as PDF, .doc / .docx and even .xls / .xlsx ) are
>> concerned with presentation of information in a particular format or
>> layout and therefore carry a significant amount of typesetting /
>> formatting information overhead in addition to the underlying data.
>> Furthermore, at the time those document-centric formats were
>> developed, ease of access to the underlying data and the unambiguous
>> meaning of specific data fields might not have been the main priority
>> in their design.
>> When the main priority is to ensure that the underlying data is
>> available on the web so that others can re-use it, we recommend using
>> simpler data formats such as CSV, TSV, JSON (or better still JSON-LD),
>> RDF or XML.

CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are  confidential and are not to be regarded as a contractual offer or acceptance from GS1 (registered in Belgium).
If you are not the addressee, or if this has been copied or sent to you in error, you must not use data herein for any purpose, you must delete it, and should inform the sender.
GS1 disclaims liability for accuracy or completeness, and opinions expressed are those of the author alone.
GS1 may monitor communications.
Third party rights acknowledged.
(c) 2012.

Received on Thursday, 13 August 2015 09:19:12 UTC