Re: Use machine-readable standardized data formats / Use non-proprietary data formats from Annette Greiner on 2015-08-13 (public-dwbp-wg@w3.org from August 2015)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Thu, 13 Aug 2015 12:47:55 -0700
To: Makx Dekkers <mail@makxdekkers.com>
Cc: public-dwbp-wg@w3.org
Message-Id: <A6EA2F4E-CB47-40FE-AD6E-E09247DAE19F@lbl.gov>
I think we do need to scope it, but limiting it to tabular data is too restrictive. Even CSV and JSON wouldn’t qualify. If you meant structured data, I think that could work.
- Annette

--
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
510-495-2935

On Aug 13, 2015, at 12:11 PM, Makx Dekkers <mail@makxdekkers.com> wrote:

> There was a thread back in March 2015 (subject Meaning of publishing Data on
> the Web) where I proposed to narrow the definition of data, for the scope of
> this group, to tabular data only.
> 
> As far as I remember, that narrowing of scope was rejected.
> 
> The problem that we still haven't solved is that different members of this
> group may have very different opinions on what 'data' is. People from a
> scientific background may think about observations of natural phenomena,
> humanists think oral histories, legal people see their  legislation and
> court decisions, financial people think budgets and spending, government
> people think base registers with information about buildings and people,
> geo-people think maps, museum people think images and 3D models of art works
> etc. etc. The use cases at https://www.w3.org/2013/dwbp/wiki/Use_Cases
> contain many different types of 'data'.
> 
> Annette writes "The more we try to cover everything that could in any way be
> conceived as data, the less specific and helpful our guidance about
> publishing data becomes." That was exactly the point I was making here
> https://lists.w3.org/Archives/Public/public-dwbp-wg/2015Mar/0036.html. Oh,
> and even further back:
> https://lists.w3.org/Archives/Public/public-dwbp-wg/2014Feb/0029.html.
> 
> Makx.
> 
> 
>> -----Original Message-----
>> From: Annette Greiner [mailto:amgreiner@lbl.gov]
>> Sent: 13 August 2015 20:30
>> To: Makx Dekkers <mail@makxdekkers.com>
>> Cc: public-dwbp-wg@w3.org
>> Subject: Re: Use machine-readable standardized data formats / Use non-
>> proprietary data formats
>> 
>> No, let's throw it out entirely. I strongly disagree with the idea that
> this group
>> should concern itself with the publication of all kinds of digital
> resources on
>> the Web  At TPAC we defined the scope to include only best practices that
>> are unique to publishing data on the web. Yes, other kinds of media can be
>> turned into data, but that doesn't mean that our scope must embrace every
>> media type posted on the web. In the end, we end up trying to write best
>> practices for publishing anything on the web, which is clearly beyond our
>> charter. The more we try to cover everything that could in any way be
>> conceived as data, the less specific and helpful our guidance about
> publishing
>> data becomes. I already worry that we are publishing a BP document with
>> very little that is helpful to people who think of themselves as
> publishing data
>> on the web. If we can't agree to even that, then I think I am in the wrong
>> working group.
>> 
>> Speaking practically, I have no idea what is meant by "original data, in
>> whatever format they have it." In my world, researchers create a variety
> of
>> data-containing documents, the vast majority of which they would never
>> dream of making public because they know it to be messy, incomplete,
>> preliminary, and useful only to those in their own research group.
> Scientific
>> data goes through a series of evolutions to make it usable for others, and
>> publishing every incarnation of it would not only preclude publication in
> a
>> peer-reviewed journal, but it would also litter the web with useless
> material.
>> Not even the archivists around here want that dross.
>> -Annette
>> --
>> Annette Greiner
>> NERSC Data and Analytics Services
>> Lawrence Berkeley National Laboratory
>> 510-495-2935
>> 
>> On Aug 13, 2015, at 4:15 AM, Makx Dekkers <mail@makxdekkers.com>
>> wrote:
>> 
>>> All,
>>> 
>>> I very much agree with Tomas here.
>>> 
>>> I think this group is supposed to give advice to people who have
>>> today's data and want to know how to best publish it on the Web, not
>>> paint a picture of how the world of data may look in ten or twenty
>>> year's time. I think today's data is mostly not ready for that quantum
>>> jump. Not catering for today's needs means this group will be writing
>>> science fiction. That can be entertaining but is maybe not so useful.
>>> 
>>> I also agree with his WordPerfect argument. Publishers should be
>>> encouraged to publish original data in whatever format they have it.
>>> In addition, the advice should be to provide the data also in
>>> additional and higher-starred formats to make it more useful.
>>> 
>>> Annette seems to suggest that "documents" are out of scope. I think
>>> the outcome of earlier discussions was that the definition of "data"
>>> is very broad and includes all kinds of digital resources on the Web.
>>> As an example, all the stuff on http://www.legislation.gov.uk/ is text
>>> and all of it is on the Web; it has the whole range of issues: formats
>>> (PDF, HTML, XML), identification, versioning, archiving, metadata,
>>> multilingualism, granularity etc. etc. Let's not throw that out.
>>> 
>>> Makx.
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Manuel.CARRASCO-BENITEZ@ec.europa.eu
>>>> [mailto:Manuel.CARRASCO-BENITEZ@ec.europa.eu]
>>>> Sent: 13 August 2015 11:19
>>>> To: mark.harrison@gs1.org; amgreiner@lbl.gov
>>>> Cc: phila@w3.org; mark.harrison@cantab.net; public-dwbp-wg@w3.org
>>>> Subject: RE: Use machine-readable standardized data formats / Use
>>>> non- proprietary data formats
>>>> 
>>>> Mark,
>>>> 
>>>> Data is a hard problem and this is aiming quite high:
>>>> 
>>>> "... the web as an electronic delivery mechanism for structured data
>>>> in
>>> open
>>>> formats ..."
>>>> 
>>>> Other groups address visualisation, etc.
>>>> 
>>>> We are the miller group with the objective to produce standardised
> flour:
>>> not
>>>> over-glamorous, but necessary . Other groups are for bakery, pastry,
> etc.
>>> :-)
>>>> 
>>>> Regards
>>>> Tomas
>>>> 
>>>> ________________________________________
>>>> From: Mark Harrison [mark.harrison@gs1.org]
>>>> Sent: 13 August 2015 07:51
>>>> To: Annette Greiner; CARRASCO BENITEZ Manuel (DGT)
>>>> Cc: phila@w3.org; Mark Harrison; public-dwbp-wg@w3.org
>>>> Subject: Re: Use machine-readable standardized data formats / Use
> non-
>>>> proprietary data formats
>>>> 
>>>> Hi Annette,
>>>> 
>>>> I completely agree with you that the discussion should be about how
>>>> to encourage people to move beyond / away from publishing static
>>>> immutable documents and towards publishing live (data + models +
>>>> interactive
>>>> visualisations) on the web that are open, interactive and
>>>> collaborative
>>> and
>>>> make it as easy as possible for people and machines to retrieve,
>>>> combine, compare, re-analyse and re-visualise data from multiple
>>>> sources just as
>>> easily
>>>> as people can use web technology to collaborate on open source
>>>> software today.
>>>> 
>>>> If our focus appears to be primarily on the web as an electronic
>>>> delivery mechanism for structured data in open formats, we're
>>>> probably aiming far too low and not giving people enough of a bold
>>>> vision about what live, interactive, collaborative, mashable data on
>>>> the web could be like in the future.
>>>> 
>>>> There are already some sites such as openspending.org that are making
>>>> good progress in that direction.  There are also toolkits and
>>>> frameworks such
>>> as
>>>> d3.js that make this vision easier to achieve.  We can probably find
>>>> and critique other examples and comment on the aspects that they do
>>>> well, as well as aspects where they could improve further.  In this
>>>> way, we can explain the big vision for what 'data on the web' really
>>>> could be, if done
>>> well.
>>>> 
>>>> As Erik says, it needs to be webby.  That could mean that the raw
>>>> data and the data transformations and visualisation are all fully
>>>> interlinked on
>>> the web
>>>> in the finest detail, potentially down to the granularity of each
>>> individual
>>>> datapoint.  Furthermore, if we want to find related datasets for
>>> comparison,
>>>> we should be able to easily retrieve those and overlay them within
>>>> the
>>> same
>>>> live visualisation - or even try modelling or visualising the data in
>>> different
>>>> ways, all interactively and collaboratively on the web.
>>>> 
>>>> Even with 5-star linked open data, we can link to existing data but
>>>> cannot immediately link to future data that has not yet been
>>>> generated - so
>>> instead
>>>> we also need to provide rich metadata that describes the scope,
>>>> coverage and granularity of the data well.  In future, we might
>>>> expect that web
>>> search
>>>> engines can not only help us to retrieve datasets and their metadata
>>>> - but allow us to tweak any of the metadata parameters in order to
>>>> search for related datasets, e.g. to find similar economic data about
>>>> a different
>>> country
>>>> or different organisation - or to find related scientific data for a
>>> related
>>>> material - or for the same material studied using a different but
>>>> related experimental technique, so that we can compare the data
>>>> easily, without having to spend so much effort tracking down the
>>>> data, reverse-engineering charts and graphs to extract data, etc.
>>>> 
>>>> To some extent, web technology already exists to enable the whole
>>>> Data Model, View and Controller to all be entirely web-based,
>>>> resulting in a
>>> live,
>>>> interactive, collaborative space for data sharing and analysis, which
>>>> has
>>> so
>>>> many advantages over static published documents.  My reference to
>>>> D3.js was one example of such technology.  I think it's a good thing
>>>> to point
>>> people
>>>> to multiple toolkits and frameworks that they can already use to
>>>> implement the bold vision of truly collaborative, interactive data on
> the
>> web.
>>>> 
>>>> I think we would miss a great opportunity if this group cannot
>>>> clearly
>>> explain
>>>> to everyone (including any member of the public) what that bold
>>>> vision for 'data on the web' could be like.  It could go far beyond
>>>> providing
>>> datasets via
>>>> the web.
>>>> 
>>>> Some people may take the time to read rather dry documents of best
>>>> practices and might even understand some of them.  Others may
>>>> understand the vision better if we can point to existing real
>>>> examples of 'data on
>>> the web
>>>> done very well' and explain which aspects they currently do very well
>>>> -
>>> and
>>>> what they could do even better.  The 'gold standard' is probably a
>>>> blend
>>> of
>>>> the best aspects of several existing examples.
>>>> 
>>>> When everyone can understand how data that is truly live on the web
>>>> has the potential to greatly increase the efficiency of research and
>>>> data
>>> analysis
>>>> and generation of new insights in so many different fields, then the
>>>> best practices documents from this group become a highly relevant and
>>>> practical step-by-step instruction manual to help everyone achieve that
>> vision.
>>>> 
>>>> Best wishes,
>>>> 
>>>> - Mark
>>>> 
>>>> ________________________________________
>>>> From: Annette Greiner <amgreiner@lbl.gov>
>>>> Sent: 12 August 2015 18:31
>>>> To: Manuel.CARRASCO-BENITEZ@ec.europa.eu
>>>> Cc: phila@w3.org; Mark Harrison; public-dwbp-wg@w3.org
>>>> Subject: Re: Use machine-readable standardized data formats / Use
> non-
>>>> proprietary data formats
>>>> 
>>>> You're not seriously suggesting people should make data available in
>>>> word perfect format, are you?
>>>> This discussion seems to be wandering into the realm of publishing
>>>> documents.
>>>> 
>>>> --
>>>> Annette Greiner
>>>> NERSC Data and Analytics Services
>>>> Lawrence Berkeley National Laboratory
>>>> 510-495-2935
>>>> 
>>>> On Aug 12, 2015, at 7:28 AM, Manuel.CARRASCO-BENITEZ@ec.europa.eu
>>>> wrote:
>>>> 
>>>>> One should have at least the following variants of the resource:
>>>>> 
>>>>> - Original     : foo.wp  - WordPerfect 3.0 ~1982, perhaps still
>>> processable
>>>>> - Content      : foo.txt - textual, hopefully processable in 100 years
>>>>> - Presentation : foo.tif - TIFF ~1986, perhaps still viewable, might
>>>>> be foo.ps
>>>>> 
>>>>> So:
>>>>> - http://example.com/foo     - negotiate and give me the best
>>>>> - http://example.com/foo.wp  - I can still process WP
>>>>> - http://example.com/foo.txt - I want to process the text, no
>>>>> presentation
>>>>> - http://example.com/foo.tif - I really want to see how the doc
>>>>> looks
>>>>> 
>>>>> Regards
>>>>> Tomas
>>>>> 
>>>>>> Perhaps the way we can formulate this is to say that some document
>>>>>> formats (such as PDF, .doc / .docx and even .xls / .xlsx ) are
>>>>>> concerned with presentation of information in a particular format
>>>>>> or layout and therefore carry a significant amount of typesetting /
>>>>>> formatting information overhead in addition to the underlying data.
>>>>>> Furthermore, at the time those document-centric formats were
>>>>>> developed, ease of access to the underlying data and the
>>>>>> unambiguous meaning of specific data fields might not have been the
>>>>>> main priority in their design.
>>>>>> 
>>>>>> When the main priority is to ensure that the underlying data is
>>>>>> available on the web so that others can re-use it, we recommend
>>>>>> using simpler data formats such as CSV, TSV, JSON (or better still
>>>>>> JSON-LD), RDF or XML.
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are
>>> confidential
>>>> and are not to be regarded as a contractual offer or acceptance from
>>>> GS1 (registered in Belgium).
>>>> If you are not the addressee, or if this has been copied or sent to
>>>> you in
>>> error,
>>>> you must not use data herein for any purpose, you must delete it, and
>>>> should inform the sender.
>>>> GS1 disclaims liability for accuracy or completeness, and opinions
>>> expressed
>>>> are those of the author alone.
>>>> GS1 may monitor communications.
>>>> Third party rights acknowledged.
>>>> (c) 2012.
>>>> </a>
>>> 
>>> 
>>> 
>
Received on Thursday, 13 August 2015 20:13:04 UTC