ISSUE-143: Is Data Preservation in the scope of the DWBP document? (Was "Re: My review of the DWBP 21st Jan editor's draft") from Christophe Guéret on 2015-02-05 (public-dwbp-wg@w3.org from February 2015)

From: Christophe Guéret <christophe.gueret@dans.knaw.nl>
Date: Thu, 5 Feb 2015 08:18:13 +0100
To: "contact@carlosiglesias.es" <contact@carlosiglesias.es>
CC: Christophe Gueret <christophe.gueret@dans.knaw.nl>, Public DWBP WG <public-dwbp-wg@w3.org>
Message-ID: <CABP9CAHuBwkBHpb90hFBGgOz4SsvUZV2xbos=7jxR613CTvmiw@mail.gmail.com>
Dear Carlos,

Thanks again for your comments. I've now changed the topic of the thread so
that the tracker could pick this up and append that to the related issue.

DATA PRESERVATION
>>>
>>
>
> I feel quite uncomfortable with this section in general. I have some
>>> problems trying to understand the underlying principles for this BPs, but
>>> overall it looks to be about data archiving generally speaking instead
>>> about data persistence that is indeed the best practice IMO and also
>>> coherent with other BPs in the document (such as versioning). In fact data
>>> archiving looks more like a bad practice for me than a best one.
>>>
>>
> Thanks for having looked at these BPs. I think data persistence and data
>> preservation are two different issues and I can't agree that data archiving
>> is a bad practice. The bad practice is that people that want to take data
>> off-line for some reason, say the end of a project funding, just leave the
>> server running until it dies out or trash the data. In these cases sending
>> the data to an archive is a good and better practice. It is even
>> increasingly backed by funding agencies that ask funded projects to come up
>> with a data management plan that includes a section about what will happen
>> to the data at the end of the project. Web data should make no exception to
>> this (IMHO).
>>
>
> I'm sorry for keep disagreeing here, but (1) if that's the scenario we
> would like to cover I think it is not properly described in the document as
> currently and (2) still this looks like a sort of least bad option, not a
> best practice, no?
>
I understand and agree with (1) but could go along with the wording you use
for (2). We have to reach a consensus and produce a coherent document. If
this section on preservation does not fit into the rest of the story and/or
is not to the like of the majority of the editors and contributors then we
should drop it.



> So these BPs are here to help people decide on what to best ship their
>> data to an archive when taking it off the Web. That's not to say these BPs
>> are the good ones, nor that this list is exhaustive, but I would very much
>> like us to keep a section about data preservation in the document and have
>> a discussion about its content.
>>
>
>
> Good, now I have a better understanding of the purpose, but still I think
> that's not a best practice. The best practice for data on the web should be
> just not leaving data die and not moving it around IMO (specially if it
> will be offline or in a packaged where all links and references will be
> broken since then).
>
That's true, just because all the data on the web is interconnected and
because the meaning of everything depends of the meaning of everything else
nobody should delete or update anything to avoid side-effects. The problem
is that will happen anyway and this is why something is needed. For the Web
documents we have the Web Archive that prove to be useful at time. Some are
proposing that we should have a similar system for Web Data, having digital
preservation institutes go out on the Web and store everything. I don't
whether if we see a need for having access to historical descriptions of
entities we should offer some BPs related to it.

This would then have to cover two aspects:
* How to consumers can get access to past descriptions of entities which
are no longer associated to the identifier ?
* How can publishers prepare their data in order to make it easier for
consumers to achieve this first goal ?

Versioning is solution to this. Offering dumps is another. Keeping
resources alive and linking them to historical dumps is another one. See
for instance at what DBpedia is doing with Memento. URIs are cool and
preserved descriptions are accessible via the memento gateway. They also
have dumps in recognized serialisation formats on their site. This are good
practices. They could also have gone for versioning but did not choose to
put the DBpedia version number in the resource names, and that's fine too.

Now, all the part about doing monitoring for file format obsolescence,
preventing bitrot, having copies of dumps on different storages, monitoring
for quality at ingest time, be sure the storage of the data is OAIS
compliant, etc. All of that is the job of a digital trusted repository. Not
the one of the data owner nor the job of the data consumer. So, as Tomas
argued several time, these aspects are surely *out* of scope.

Hope that will help discussing that further. I can also propose we spend a
significant time on this specific issue at one of the upcoming meetings to
take a final decision about this issue.

Cheers,
Christophe




-- 
---

Carlos Iglesias.
Internet & Web Consultant.
+34 687 917 759
contact@carlosiglesias.es
@carlosiglesias
http://es.linkedin.com/in/carlosiglesiasmoro/en

>


-- 
Onderzoeker
+31(0)6 14576494
christophe.gueret@dans.knaw.nl

*Data Archiving and Networked Services (DANS)*

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
NWO.


Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB
Den Haag | +31 70 349 44 50 | info@dans.knaw.nl <info@dans.kn> |
www.dans.knaw.nl


*Let's build a World Wide Semantic Web!*
http://worldwidesemanticweb.org/

*e-Humanities Group (KNAW)*
[image: eHumanities] <http://www.ehumanities.nl/>
Received on Thursday, 5 February 2015 07:19:04 UTC