Re: Use machine-readable standardized data formats / Use non-proprietary data formats from Mark Harrison on 2015-08-13 (public-dwbp-wg@w3.org from August 2015)

From: Mark Harrison <mark.harrison@gs1.org>
Date: Thu, 13 Aug 2015 20:19:18 +0000
To: Annette Greiner <amgreiner@lbl.gov>, Makx Dekkers <mail@makxdekkers.com>
CC: "public-dwbp-wg@w3.org" <public-dwbp-wg@w3.org>
Message-ID: <1439497159118.22640@gs1.org>
Hi Annette,

Regarding “original data, in whatever format they have it”, I would interpret that as the carefully selected processed data that was used to generate the diagrams, graphs, tables and charts that appear within a final published document that has undergone some form of peer review or careful editorial control.  We're not talking about publishing every scribble in a lab notebook.  I think we can agree on that point at least.

I also don't think we need to go into too much specific detail for every conceivable media type.  From a provenance perspective, it is probably very helpful to link the extracted 'original data' to the DOI or URI of the published document and perhaps its media type.  In many situations, the data that generated graphs or charts will usually be tabular anyway - or can be expressed in a tabular format.  Even the lyrics of a song or the subtitles of a video can be treated as tabular data, indexed against a timestamp relative to the start of an audio/video file.  

I also agree with the comments that we should not broaden our scope to include best practices on data visualisation - especially if that is already being handled elsewhere.  We might reference those efforts.

However, the point I was trying to make (in response to your original remark about publishing documents) is that if someone has received a document and wants to obtain the underlying data, it is not a trivial matter.  They could contact the authors and request the data.  They might use software (such as DataThief http://www.datathief.org/ ) to reverse engineer a graph or chart, to attempt to extract the original data.  They might copy and paste data from a table and do some clean-up work.  All of this is slow and laborious and may also result in some loss of data fidelity.

If we can provide authors with practical guidance about publishing the underlying data for the tables and graphs/charts contained in the final document as a tabular format with appropriate metadata including a link to the corresponding published document in which it appears - and the reference to the table or figure within that document, then I would not be surprised if that approach covers at least 80% of the use cases, without having to do anything specific for different media types, current or archaic.

If we do that well, then we can help to make the refined, selected, processed data out of documents and available via the web, so that anyone can then re-use that data for whatever purpose.  The bold vision I alluded to may seem too futuristic to some - though openspending.org and some other sites seem to be heading in that direction.  We don't need to focus on that for now - but our current work will also provide a solid foundation for others to implement wholly web-based collaborative interactive data mashups in future if they wish to do so - because the original data and corresponding metadata would be on the web in a non-proprietary machine-readable format and web-based tools are already available to make such data-driven visualisations easier.  

In that future vision, for some kinds of data the current reliance on proprietary application software may diminish (if the data analysis and visualisation is instead done using web-based tools) - and if/when that shift occurs, it may also be more natural for people to think about publishing data rather than primarily publishing documents that encapsulate that data in a rather inaccessible format.   That is the other reason why I mentioned that future vision - in response to your concern about the discussion drifting toward publishing documents.  I was trying to provide some motivation for why people should make the effort to publish the original data on the web - what kinds of things could it enable?  It may seem like additional effort for the data publisher but it is the data consumer community that benefits from that effort - and for any given dataset or document we should hope that the number of data consumers outweighs the number of data publishers, so that altruistic extra effort by the data publisher is justifiable.

Best wishes,

- Mark


________________________________________
From: Annette Greiner <amgreiner@lbl.gov>
Sent: 13 August 2015 19:29
To: Makx Dekkers
Cc: public-dwbp-wg@w3.org
Subject: Re: Use machine-readable standardized data formats / Use    non-proprietary data formats

No, let’s throw it out entirely. I strongly disagree with the idea that this group should concern itself with the publication of all kinds of digital resources on the Web  At TPAC we defined the scope to include only best practices that are unique to publishing data on the web. Yes, other kinds of media can be turned into data, but that doesn't mean that our scope must embrace every media type posted on the web. In the end, we end up trying to write best practices for publishing anything on the web, which is clearly beyond our charter. The more we try to cover everything that could in any way be conceived as data, the less specific and helpful our guidance about publishing data becomes. I already worry that we are publishing a BP document with very little that is helpful to people who think of themselves as publishing data on the web. If we can’t agree to even that, then I think I am in the wrong working group.

Speaking practically, I have no idea what is meant by “original data, in whatever format they have it.” In my world, researchers create a variety of data-containing documents, the vast majority of which they would never dream of making public because they know it to be messy, incomplete, preliminary, and useful only to those in their own research group. Scientific data goes through a series of evolutions to make it usable for others, and publishing every incarnation of it would not only preclude publication in a peer-reviewed journal, but it would also litter the web with useless material. Not even the archivists around here want that dross.
-Annette
--
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
510-495-2935

On Aug 13, 2015, at 4:15 AM, Makx Dekkers <mail@makxdekkers.com> wrote:

> All,
>
> I very much agree with Tomas here.
>
> I think this group is supposed to give advice to people who have today's
> data and want to know how to best publish it on the Web, not paint a picture
> of how the world of data may look in ten or twenty year's time. I think
> today's data is mostly not ready for that quantum jump. Not catering for
> today's needs means this group will be writing science fiction. That can be
> entertaining but is maybe not so useful.
>
> I also agree with his WordPerfect argument. Publishers should be encouraged
> to publish original data in whatever format they have it. In addition, the
> advice should be to provide the data also in additional and higher-starred
> formats to make it more useful.
>
> Annette seems to suggest that "documents" are out of scope. I think the
> outcome of earlier discussions was that the definition of "data" is very
> broad and includes all kinds of digital resources on the Web. As an example,
> all the stuff on http://www.legislation.gov.uk/ is text and all of it is on
> the Web; it has the whole range of issues: formats (PDF, HTML, XML),
> identification, versioning, archiving, metadata, multilingualism,
> granularity etc. etc. Let's not throw that out.
>
> Makx.
>
>
>> -----Original Message-----
>> From: Manuel.CARRASCO-BENITEZ@ec.europa.eu
>> [mailto:Manuel.CARRASCO-BENITEZ@ec.europa.eu]
>> Sent: 13 August 2015 11:19
>> To: mark.harrison@gs1.org; amgreiner@lbl.gov
>> Cc: phila@w3.org; mark.harrison@cantab.net; public-dwbp-wg@w3.org
>> Subject: RE: Use machine-readable standardized data formats / Use non-
>> proprietary data formats
>>
>> Mark,
>>
>> Data is a hard problem and this is aiming quite high:
>>
>>  "... the web as an electronic delivery mechanism for structured data in
> open
>> formats ..."
>>
>> Other groups address visualisation, etc.
>>
>> We are the miller group with the objective to produce standardised flour:
> not
>> over-glamorous, but necessary . Other groups are for bakery, pastry, etc.
> :-)
>>
>> Regards
>> Tomas
>>
>> ________________________________________
>> From: Mark Harrison [mark.harrison@gs1.org]
>> Sent: 13 August 2015 07:51
>> To: Annette Greiner; CARRASCO BENITEZ Manuel (DGT)
>> Cc: phila@w3.org; Mark Harrison; public-dwbp-wg@w3.org
>> Subject: Re: Use machine-readable standardized data formats / Use   non-
>> proprietary data formats
>>
>> Hi Annette,
>>
>> I completely agree with you that the discussion should be about how to
>> encourage people to move beyond / away from publishing static immutable
>> documents and towards publishing live (data + models + interactive
>> visualisations) on the web that are open, interactive and collaborative
> and
>> make it as easy as possible for people and machines to retrieve, combine,
>> compare, re-analyse and re-visualise data from multiple sources just as
> easily
>> as people can use web technology to collaborate on open source software
>> today.
>>
>> If our focus appears to be primarily on the web as an electronic delivery
>> mechanism for structured data in open formats, we're probably aiming far
>> too low and not giving people enough of a bold vision about what live,
>> interactive, collaborative, mashable data on the web could be like in the
>> future.
>>
>> There are already some sites such as openspending.org that are making good
>> progress in that direction.  There are also toolkits and frameworks such
> as
>> d3.js that make this vision easier to achieve.  We can probably find and
>> critique other examples and comment on the aspects that they do well, as
>> well as aspects where they could improve further.  In this way, we can
>> explain the big vision for what 'data on the web' really could be, if done
> well.
>>
>> As Erik says, it needs to be webby.  That could mean that the raw data and
>> the data transformations and visualisation are all fully interlinked on
> the web
>> in the finest detail, potentially down to the granularity of each
> individual
>> datapoint.  Furthermore, if we want to find related datasets for
> comparison,
>> we should be able to easily retrieve those and overlay them within the
> same
>> live visualisation - or even try modelling or visualising the data in
> different
>> ways, all interactively and collaboratively on the web.
>>
>> Even with 5-star linked open data, we can link to existing data but cannot
>> immediately link to future data that has not yet been generated - so
> instead
>> we also need to provide rich metadata that describes the scope, coverage
>> and granularity of the data well.  In future, we might expect that web
> search
>> engines can not only help us to retrieve datasets and their metadata - but
>> allow us to tweak any of the metadata parameters in order to search for
>> related datasets, e.g. to find similar economic data about a different
> country
>> or different organisation - or to find related scientific data for a
> related
>> material - or for the same material studied using a different but related
>> experimental technique, so that we can compare the data easily, without
>> having to spend so much effort tracking down the data, reverse-engineering
>> charts and graphs to extract data, etc.
>>
>> To some extent, web technology already exists to enable the whole Data
>> Model, View and Controller to all be entirely web-based, resulting in a
> live,
>> interactive, collaborative space for data sharing and analysis, which has
> so
>> many advantages over static published documents.  My reference to D3.js
>> was one example of such technology.  I think it's a good thing to point
> people
>> to multiple toolkits and frameworks that they can already use to implement
>> the bold vision of truly collaborative, interactive data on the web.
>>
>> I think we would miss a great opportunity if this group cannot clearly
> explain
>> to everyone (including any member of the public) what that bold vision for
>> 'data on the web' could be like.  It could go far beyond providing
> datasets via
>> the web.
>>
>> Some people may take the time to read rather dry documents of best
>> practices and might even understand some of them.  Others may understand
>> the vision better if we can point to existing real examples of 'data on
> the web
>> done very well' and explain which aspects they currently do very well -
> and
>> what they could do even better.  The 'gold standard' is probably a blend
> of
>> the best aspects of several existing examples.
>>
>> When everyone can understand how data that is truly live on the web has
>> the potential to greatly increase the efficiency of research and data
> analysis
>> and generation of new insights in so many different fields, then the best
>> practices documents from this group become a highly relevant and practical
>> step-by-step instruction manual to help everyone achieve that vision.
>>
>> Best wishes,
>>
>> - Mark
>>
>> ________________________________________
>> From: Annette Greiner <amgreiner@lbl.gov>
>> Sent: 12 August 2015 18:31
>> To: Manuel.CARRASCO-BENITEZ@ec.europa.eu
>> Cc: phila@w3.org; Mark Harrison; public-dwbp-wg@w3.org
>> Subject: Re: Use machine-readable standardized data formats / Use   non-
>> proprietary data formats
>>
>> You're not seriously suggesting people should make data available in word
>> perfect format, are you?
>> This discussion seems to be wandering into the realm of publishing
>> documents.
>>
>> --
>> Annette Greiner
>> NERSC Data and Analytics Services
>> Lawrence Berkeley National Laboratory
>> 510-495-2935
>>
>> On Aug 12, 2015, at 7:28 AM, Manuel.CARRASCO-BENITEZ@ec.europa.eu
>> wrote:
>>
>>> One should have at least the following variants of the resource:
>>>
>>> - Original     : foo.wp  - WordPerfect 3.0 ~1982, perhaps still
> processable
>>> - Content      : foo.txt - textual, hopefully processable in 100 years
>>> - Presentation : foo.tif - TIFF ~1986, perhaps still viewable, might
>>> be foo.ps
>>>
>>> So:
>>> - http://example.com/foo     - negotiate and give me the best
>>> - http://example.com/foo.wp  - I can still process WP
>>> - http://example.com/foo.txt - I want to process the text, no
>>> presentation
>>> - http://example.com/foo.tif - I really want to see how the doc looks
>>>
>>> Regards
>>> Tomas
>>>
>>>> Perhaps the way we can formulate this is to say that some document
>>>> formats (such as PDF, .doc / .docx and even .xls / .xlsx ) are
>>>> concerned with presentation of information in a particular format or
>>>> layout and therefore carry a significant amount of typesetting /
>>>> formatting information overhead in addition to the underlying data.
>>>> Furthermore, at the time those document-centric formats were
>>>> developed, ease of access to the underlying data and the unambiguous
>>>> meaning of specific data fields might not have been the main priority
>>>> in their design.
>>>>
>>>> When the main priority is to ensure that the underlying data is
>>>> available on the web so that others can re-use it, we recommend using
>>>> simpler data formats such as CSV, TSV, JSON (or better still
>>>> JSON-LD), RDF or XML.
>>>
>>
>>
>>
>>
>> CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are
> confidential
>> and are not to be regarded as a contractual offer or acceptance from GS1
>> (registered in Belgium).
>> If you are not the addressee, or if this has been copied or sent to you in
> error,
>> you must not use data herein for any purpose, you must delete it, and
>> should inform the sender.
>> GS1 disclaims liability for accuracy or completeness, and opinions
> expressed
>> are those of the author alone.
>> GS1 may monitor communications.
>> Third party rights acknowledged.
>> (c) 2012.
>> </a>
>
>
>



CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are  confidential and are not to be regarded as a contractual offer or acceptance from GS1 (registered in Belgium). 
If you are not the addressee, or if this has been copied or sent to you in error, you must not use data herein for any purpose, you must delete it, and should inform the sender. 
GS1 disclaims liability for accuracy or completeness, and opinions expressed are those of the author alone. 
GS1 may monitor communications. 
Third party rights acknowledged. 
(c) 2012.
</a>
Received on Thursday, 13 August 2015 20:19:53 UTC