General feedback on the document

From: Christophe Guéret <christophe.gueret@dans.knaw.nl>
Date: Fri, 6 Mar 2015 14:22:40 +0100
Message-ID: <CABP9CAHZ9e3OgdQQT-X5EZR_6z6=ZFA1w7X7UMksQRAzb3LUoQ@mail.gmail.com>
To: <public-dwbp-comments@w3.org>
CC: <jtennis@uw.edu>, <smiragli@uwm.edu>, Aida Slavic <aida.slavic@udcc.org>, Almila Akdag Salah <alelma@gmail.com>, Albert Meroño Peñuela <albert.meronyo@gmail.com>, <toby.burrows@uwa.edu.au>, <valentine.charles@europeana.eu>, Henk van den Berg <henk.van.den.berg@dans.knaw.nl>, <kzervanou@yahoo.co.uk>, Rob Koopman <Rob.Koopman@oclc.org>, Windhouwer Menzo <menzo@windhouwer.nl>, Shenghui Wang <shenghui.wang@gmail.com>, Andrea Scharnhorst <andrea.scharnhorst@dans.knaw.nl>, <cristina.bucur@student.vu.nl>
Dear DWBP group,

Yesterday and the day before I was sitting next to KOS experts for this

We used my speaking slot to have a look at
http://www.w3.org/TR/2015/WD-dwbp-20150224/ and provide some comments.
These comments are, hopefully faithfully, reproduced in that follows.
Everyone that attended the event is also CCed in this mail and may jump in
to correct things when needed, or further comment.

# Overall points
The document concerns more data publishers than it concerns consumers. This
also seems to be reflected by the composition of editors/contributors,
there should be more data consumers jumping in and adding BPs that matter
to them.
"Data must be available in machine readable" -> only should, must is way
too strong. Some data consumers may want to have access to data that is not
machine readable (e.g. scanned old document) and not being only restricted
to their machine-translated counterparts (e.g. OCRed old document)

# Data vocabularies
Issue 9 : we should stick to using "vocabularies"
Issue 10 : we should aim at being generic
BP 19: there is a problem in advocating for simplicity as this can prevent
from having rich vocabularies. It could instead be suggest that publishers
may provide vocabularies as rich as needed but strive at basing them on
"simpler" ones (e.g. core ontologies / upper ontologies / ... ) to ensure
there is always a minimum level of understanding. See, e.g.
http://arxiv.org/abs/1304.5743 for a discussion about this.

# Preservation
There are existing guidelines about the process of preservation itself.
Those could be cited to guide people on how to do preservation. There is
also a lot of repositories that exist to preserve data at different levels
(institution, national, ...).
There should be something there! In terms of BPs, the following points
should be addressed:
* As a data publisher, do you want to, or have to, preserve your data ?
* If yes, what to preserve ?
* Who to give it to ? Only to one archive or several ? One could be
mandated to do preservation whatever is quality as an archive is. There are
existing certifications (DSA, etc) that can be used to help publishers make
informed choices about who to trust.
* Think about the level of access for the preserved copy (public, private,
* The type of data matter for preservation. Publishers need to be aware of
that. It is also important to think about preserving with context and thus
push not only a dataset alone but also preserve the resources that are
needed to make sense of it (documentation, schemas, ...)

# Feedback
This section should also relate to preservation. One way to do it is to
list stakeholders around preservation (see RDA for an impression).
BP: there should be identifiers to give feedback on a specific part of the
BP: Use feedback as data enrichment, e.g. crowd annotation

# Metadata
Need to say where the taxonomy comes from. The document speaks about 3
types instead of the 5 commonly observed. The two missing ones are
preservation metadata (how, where, ...) and technical metadata (EXIF,...)
BP: Use standard terms but then make extensions public when they are needed

# Data quality
Does this applies to data or metadata ?
There is a lot of granularity aspects in data that need to be taken in
How do you define quality ?
Completeness of the data is not related to quality. There should be an
element of comparison to check the completeness against something (e.g.
"data is complete according to EDM")
There should be something about Quality VS Usability, partly because
fitting data into quality standards can lead to loosing important data
(mainly everything that does not fit)


