RE: [BP - PRE] Data preservation already taken care of ? from Manuel.CARRASCO-BENITEZ@ec.europa.eu on 2014-05-23 (public-dwbp-wg@w3.org from May 2014)

From: <Manuel.CARRASCO-BENITEZ@ec.europa.eu>
Date: Fri, 23 May 2014 10:00:07 +0000
To: <christophe.gueret@dans.knaw.nl>
CC: <public-dwbp-wg@w3.org>, <phila@w3.org>
Message-ID: <39DB516E46C0E842A2CFFF1BBB7412F15F7B6D64@S-DC-ESTF03-B.net1.cec.eu.int>
Dear Christophe,


-        Data preservation should be out of scope

-        URI preservation and the points below should be in scope

-        Data off the web; i.e., standalone data that should be web friendly. For example, using the file scheme https://www.w3.org/2013/dwbp/wiki/Data_on_the_Web_URI_Best_Practices#Network_and_local_URI


-        Exporting/importing data from the web

-        Data packing. For example, http://joinup.ec.europa.eu/site/med


-        First choice: preserve all in archives. So, redirection to archives. Archives like data preservation are out of scope.

-        Second best: preserve the original raw data and the programs generating other representations. Hence, above packing techniques.

-        Third: at least preserve the dataset existence, even if the data has been deleted. So, one needs a register for data off the web. For example, all the URIs from the FP7 project below could be redirected to one record with the details of the data.

Regards
Tomas


From: Christophe Guéret [mailto:christophe.gueret@dans.knaw.nl]
Sent: Friday, May 23, 2014 9:53 AM
To: CARRASCO BENITEZ Manuel (DGT)
Cc: Christophe Gueret; public-dwbp-wg@w3.org; phila@w3.org
Subject: Re: [BP - PRE] Data preservation already taken care of ?

Dear Tomas,

I agree with you that data preservation should be out of scope of this working group. We should not aim, IMHO, at providing guidance on preserving bit rot in files, apply or not the LOCKSS principle, or perform re-serialisation of data when some formats become deprecated. These considerations apply to many, many, more type of data than structured data on the Web and there are also a large body of experts and expertise on the topic as you rightly point out.
Nonetheless, we have to provide guidance for putting data on the Web and thus have to inform users what to do when they want to take this data off-line. This could happen for several reasons ranging from the end of the funding period of the project to a new release of a data set. Saying that every URI should be cool and stay alive for eternity will not hold in the real life. URIs will change and go 404, it's best if we can advise people on how to handle things.

To take more concrete examples I was involved in, what happens at the end of a FP7 project that published RDF converted from third-parties ? There will be no fund, and thus no time, to keep that RDF data in sync with the original source any more. Some money could be put aside for keeping up the web site of the project, and eventually a triple-store + de-referencing interface too, for some time but not for, say, 50 years. The best chance for this data is if either the original owner takes it back and publish+update it himself or if a follow up project need it to (and don't want to re-do the conversion). Otherwise, at best and in most of the cases, you end up with an abandoned data set served for a couple of years from an end point nobody maintains.

If we would produce a BP that says, at least, "MUST: don't let data on the Web die of a slow death on some lost corner of the Web, send a dump to an archive and take you stuff off-line" - and if people follow it, of course - we would already have achieved a lot for a better Web of data. Then, an archive receiving such a dump will have to monitor bit rot, do duplications, assign UUIDs and monitor file formats but we both agree this is a different matter outside of the scope of the WG.
Actually, preserving a dump may not be optimal solution we want to recommend and I'm looking forward to discussing that in more details with everyone. There are plenty of interesting questions popping up from this "on the Web" aspect of the data:
* There is the data (RDF, Graph, matrices) and the representation of the data (HTML+CSS, JSON, JSON-LD, ...). What do we want to preserve ? One of the two or both of them ?
* Do we need to create a http://webdata.archive.org/ to preserve Web data like http://web.archive.org preserves Web documents ? E.g. if a URIs goes 404 go get a preserved, timestamped, snapshot from that service. Such a service would prevent data publishers from having to send dumps anywhere
* But if for preserving the Web of Data we focus on preserving the Web representation in HTML, how do we deal with HTML that do not contain RDFa/microdata that could give us the data back ? Is it a big deal ? Also, what happen to all this Web data that is used only from within a SPARQL end point and don't have de-referencable URIs ?
* If there was some reasoning involved what do you preserve ? The source dataset(s), only the inferred triple, all the triple (source + inferred), the source dataset(s) + the reasoner software ?
* If the Web data has been generated from "legacy data", what should be preserved ? The data generated, the script that generated it, the source data, everything ?
etc

That's all that comes to my mind now in terms of questions but we can surely find more. This is, again IMHO, the kind of questions this BP document on data preservation should tackle.

Regards,
Christophe


On 22 May 2014 15:42, Manuel.CARRASCO-BENITEZ@ec.europa.eu<mailto:Manuel.CARRASCO-BENITEZ@ec.europa.eu> <Manuel.CARRASCO-BENITEZ@ec.europa.eu<mailto:Manuel.CARRASCO-BENITEZ@ec.europa.eu>> wrote:
Dear Mr. Guéret

The WG must state its scope. If this WG wants to address data preservation, a first step must be to review previous works and in particular the LTANS: it is not a trivial work.

I wrote the wiki page of Data preservation as an assignment following a discussion during a meeting
http://www.w3.org/2013/dwbp/wiki/Data_preservation


This only addresses data (resource) preservation. URI preservation is a related but distinct
  http://www.w3.org/Provider/Style/URI.html


Regards
Tomas


From: Christophe Guéret [mailto:christophe.gueret@dans.knaw.nl<mailto:christophe.gueret@dans.knaw.nl>]
Sent: Wednesday, May 21, 2014 2:46 PM
To: public-dwbp-wg@w3.org<mailto:public-dwbp-wg@w3.org>
Cc: Phil Archer
Subject: [BP - PRE] Data preservation already taken care of ?

Hoi,

It does not seem "PRE" has been claimed by any BP document yet so let's use it for the "Data Preservation" one is that is ok with everyone ;-)

I was just about to create the wiki page to put some content in but found out this page already existed : http://www.w3.org/2013/dwbp/wiki/Data_preservation

It reads "Data preservation should be out of scope for the Data on the Web Best Practices Working Group (DWBP WG) as data preservation is a large complex field. DWBP should be about how to access data and not how to preserve it. The DWBP WG should use the work of other groups such as Long-Term Archive and Notary Services (ltans). The Long-Term Archive Service Requirements should illustrate the complexity of this field. A lighter read might be A System for Long-Term Document Preservation"
Not that I fundamentally disagree with that but I can not remember such a finite decision was been taken by the group (I did miss many calls). Are we still looking at producing a BP for data preservation or not ?

BTW, to give you an update on that, we are busy at DANS implementing a system tailored for the preservation of Linked Data - Open or not - and our first prototype has a distinction between what belong to the data preservation part (and is thus shared with PNGs, PDFs, etc) and what is more specific to Linked Data. It seems to me that though the former is (clearly ?) out of scope for our group we could still provide some interesting guidance for the later.
Cheers,
Christophe
--
Onderzoeker
+31(0)6 14576494
christophe.gueret@dans.knaw.nl<mailto:christophe.gueret@dans.knaw.nl>


Data Archiving and Networked Services (DANS)

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op www.dans.knaw.nl<http://www.dans.knaw.nl> voor meer informatie. DANS is een instituut van KNAW en NWO.



Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB Den Haag | +31 70 349 44 50 | info@dans.knaw.nl<mailto:info@dans.kn> | www.dans.knaw.nl<http://www.dans.knaw.nl>


Let's build a World Wide Semantic Web!
http://worldwidesemanticweb.org/


e-Humanities Group (KNAW)
[Image removed by sender. eHumanities]<http://www.ehumanities.nl/>



--
Onderzoeker
+31(0)6 14576494
christophe.gueret@dans.knaw.nl<mailto:christophe.gueret@dans.knaw.nl>


Data Archiving and Networked Services (DANS)

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op www.dans.knaw.nl<http://www.dans.knaw.nl> voor meer informatie. DANS is een instituut van KNAW en NWO.



Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB Den Haag | +31 70 349 44 50 | info@dans.knaw.nl<mailto:info@dans.kn> | www.dans.knaw.nl


Let's build a World Wide Semantic Web!
http://worldwidesemanticweb.org/


e-Humanities Group (KNAW)
[Image removed by sender. eHumanities]<http://www.ehumanities.nl/>
Attachments

image/jpeg attachment: image001.jpg
image/jpeg attachment: image002.jpg
Received on Friday, 23 May 2014 10:00:45 UTC