Re: [BP - PRE] Data preservation already taken care of ?

Dear Tomas,

I agree with you that data preservation should be out of scope of this
working group. We should not aim, IMHO, at providing guidance on preserving
bit rot in files, apply or not the LOCKSS principle, or perform
re-serialisation of data when some formats become deprecated. These
considerations apply to many, many, more type of data than structured data
on the Web and there are also a large body of experts and expertise on the
topic as you rightly point out.

Nonetheless, we have to provide guidance for putting data on the Web and
thus have to inform users what to do when they want to take this data
off-line. This could happen for several reasons ranging from the end of the
funding period of the project to a new release of a data set. Saying that
every URI should be cool and stay alive for eternity will not hold in the
real life. URIs will change and go 404, it's best if we can advise people
on how to handle things.

To take more concrete examples I was involved in, what happens at the end
of a FP7 project that published RDF converted from third-parties ? There
will be no fund, and thus no time, to keep that RDF data in sync with the
original source any more. Some money could be put aside for keeping up the
web site of the project, and eventually a triple-store + de-referencing
interface too, for some time but not for, say, 50 years. The best chance
for this data is if either the original owner takes it back and
publish+update it himself or if a follow up project need it to (and don't
want to re-do the conversion). Otherwise, at best and in most of the cases,
you end up with an abandoned data set served for a couple of years from an
end point nobody maintains.

If we would produce a BP that says, at least, "MUST: don't let data on the
Web die of a slow death on some lost corner of the Web, send a dump to an
archive and take you stuff off-line" - and if people follow it, of course -
we would already have achieved a lot for a better Web of data. Then, an
archive receiving such a dump will have to monitor bit rot, do
duplications, assign UUIDs and monitor file formats but we both agree this
is a different matter outside of the scope of the WG.

Actually, preserving a dump may not be optimal solution we want to
recommend and I'm looking forward to discussing that in more details with
everyone. There are plenty of interesting questions popping up from this
"on the Web" aspect of the data:
* There is the data (RDF, Graph, matrices) and the representation of the
data (HTML+CSS, JSON, JSON-LD, ...). What do we want to preserve ? One of
the two or both of them ?
* Do we need to create a http://webdata.archive.org/ to preserve Web data
like http://web.archive.org preserves Web documents ? E.g. if a URIs goes
404 go get a preserved, timestamped, snapshot from that service. Such a
service would prevent data publishers from having to send dumps anywhere
* But if for preserving the Web of Data we focus on preserving the Web
representation in HTML, how do we deal with HTML that do not contain
RDFa/microdata that could give us the data back ? Is it a big deal ? Also,
what happen to all this Web data that is used only from within a SPARQL end
point and don't have de-referencable URIs ?
* If there was some reasoning involved what do you preserve ? The source
dataset(s), only the inferred triple, all the triple (source + inferred),
the source dataset(s) + the reasoner software ?
* If the Web data has been generated from "legacy data", what should be
preserved ? The data generated, the script that generated it, the source
data, everything ?
etc

That's all that comes to my mind now in terms of questions but we can
surely find more. This is, again IMHO, the kind of questions this BP
document on data preservation should tackle.

Regards,
Christophe


On 22 May 2014 15:42, Manuel.CARRASCO-BENITEZ@ec.europa.eu <
Manuel.CARRASCO-BENITEZ@ec.europa.eu> wrote:

>  Dear Mr. Guéret
>
>
>
> The WG must state its scope. If this WG wants to address data
> preservation, a first step must be to review previous works and in
> particular the LTANS: it is not a trivial work.
>
>
>
> I wrote the wiki page of *Data preservation* as an assignment following a
> discussion during a meeting
>
> http://www.w3.org/2013/dwbp/wiki/Data_preservation
>
>
>
> This only addresses data (resource) preservation. URI preservation is a
> related but distinct
>
>   http://www.w3.org/Provider/Style/URI.html
>
>
>
> Regards
>
> Tomas
>
>
>
>
>
> *From:* Christophe Guéret [mailto:christophe.gueret@dans.knaw.nl]
> *Sent:* Wednesday, May 21, 2014 2:46 PM
> *To:* public-dwbp-wg@w3.org
> *Cc:* Phil Archer
> *Subject:* [BP - PRE] Data preservation already taken care of ?
>
>
>
> Hoi,
>
> It does not seem "PRE" has been claimed by any BP document yet so let's
> use it for the "Data Preservation" one is that is ok with everyone ;-)
>
> I was just about to create the wiki page to put some content in but found
> out this page already existed :
> http://www.w3.org/2013/dwbp/wiki/Data_preservation
>
> It reads "Data preservation should be out of scope for the Data on the Web
> Best Practices Working Group (DWBP WG) as data preservation is a large
> complex field. DWBP should be about how to access data and not how to
> preserve it. The DWBP WG should use the work of other groups such as
> Long-Term Archive and Notary Services (ltans). The Long-Term Archive
> Service Requirements should illustrate the complexity of this field. A
> lighter read might be A System for Long-Term Document Preservation"
>
> Not that I fundamentally disagree with that but I can not remember such a
> finite decision was been taken by the group (I did miss many calls). Are we
> still looking at producing a BP for data preservation or not ?
>
>
>
> BTW, to give you an update on that, we are busy at DANS implementing a
> system tailored for the preservation of Linked Data - Open or not - and our
> first prototype has a distinction between what belong to the data
> preservation part (and is thus shared with PNGs, PDFs, etc) and what is
> more specific to Linked Data. It seems to me that though the former is
> (clearly ?) out of scope for our group we could still provide some
> interesting guidance for the later.
>
> Cheers,
> Christophe
>
> --
>
> Onderzoeker
> +31(0)6 14576494
> christophe.gueret@dans.knaw.nl
>
>
>
> *Data Archiving and Networked Services (DANS)*
>
> DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
> www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
> NWO.
>
>
>
> Let op, per 1 januari hebben we een nieuw adres:
>
> DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB
> Den Haag | +31 70 349 44 50 | info@dans.knaw.nl <info@dans.kn> |
> www.dans.knaw.nl
>
>
>
> *Let's build a World Wide Semantic Web!*
> http://worldwidesemanticweb.org/
>
>
> *e-Humanities Group (KNAW)*
> [image: Image removed by sender. eHumanities] <http://www.ehumanities.nl/>
>



-- 
Onderzoeker
+31(0)6 14576494
christophe.gueret@dans.knaw.nl

*Data Archiving and Networked Services (DANS)*

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
NWO.


Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB
Den Haag | +31 70 349 44 50 | info@dans.knaw.nl <info@dans.kn> |
www.dans.knaw.nl


*Let's build a World Wide Semantic Web!*
http://worldwidesemanticweb.org/

*e-Humanities Group (KNAW)*
[image: eHumanities] <http://www.ehumanities.nl/>

Received on Friday, 23 May 2014 07:53:59 UTC