Re: [dwbp] BP Document ready to be reviewed from Christophe Guéret on 2015-02-18 (public-dwbp-wg@w3.org from February 2015)

From: Christophe Guéret <christophe.gueret@dans.knaw.nl>
Date: Wed, 18 Feb 2015 12:56:58 +0100
To: Newton Calegari <newton@nic.br>
CC: DWBP Public List <public-dwbp-wg@w3.org>
Message-ID: <CABP9CAFtcRSQgBPdV-3L_hmnOzepmYgHtr5KeD61WVk4Gb_xtA@mail.gmail.com>
Thanks Newton & the team !

After a read through all the documents I think there is still two big
issues with it:

* We speak of data "on the Web" but do not give any concrete example of it
leaving some doubt about whether putting a CSV on a web site is "data on
the Web" or if the document is more about LD-like work, or both. In Section
8.3, saying "Source formats, such as database dumps or spreadsheets, used
to generate the final published format, are out of scope" suggest that we
care for the later whereas "However, when data is distributed across
multiple files" in Section 8.6 relates more to the publish-as-files
paradigm. There are also several points raised about publishing additional
documentation as HTML documents...
We could probably fix that by saying we aim at using the Web as a platform
first, and publishing dumps when necessary/applicable (e.g. DBpedia-like
with dumps on the project page and LD access). All that together with HTML
documentation. Adding some concrete (fictive) examples could also help make
it clear what the document is about.

* It is not said if the best practices can also apply to closed data, and
how. The part about privacy deals with exposing potentially sensitive data
as open data but do not provide guidance on how a company could implement
the DWBP internally to manage its core business data.
For this I'd suggest to highlight that using the Web as a data publishing
platform allows for the usage of long established strategies for having
closed Web systems.


Here are more specific points :

* "Data consumers (who may also be producers themselves) want to be able to
find and use data, especially if it is accurate, regularly updated and
guaranteed to be available at all times." : digital archives are typically
seen as a integral part of the "all time availability". Data publishers are
typically not in a position of doing that, unless they have an archive on
their own of course, and may rely on external parties to help them. We
could have some BPs to foster such internal or external collaboration.

* BP5 : "Where an international format specification exists, e.g., ISO 8601
for dates and times, use it." also applies to the data itself. Shall we
extract that point and make it a more general BP on its own ?

* BP6 : "the presence of an RDF predicate" could be changed into "the
presence of license-related vocabulary" to avoid letting the reader think
this only applies to data published in RDF. For data published as Zip or
Tar files (as suggested later in the doc) it could also be suggested to
test for license embedded as a separate document in the package.

* BP7 : "Data provenance is metadata that corresponds to data. Data
provenance relies upon existing vocabularies that make provenance easily
identifiable such as the Provenance Ontology [PROV-O
<http://w3c.github.io/dwbp/bp.html#bib-PROV-O>]." is more an implementation
suggestion than a "Why"

* BP11 : "Avoid broken URIs. In the event that a resource has been modified
or deleted, those changes must be communicated using the appropriate
response code [RFC7231 <http://w3c.github.io/dwbp/bp.html#bib-RFC7231>]. If
the resource has changed location, HTTP 3XX codes should be used, whereas
if the resource has been deleted a HTTP 410 code should be used." => This
was tackled in one of the preservation BP together with other codes to
relate "live" description to their preserved historical counterparts. If we
decide that keeping URIs in a good state and giving people access to
historical version is in scope we may want to look at the potential overlap
between this point and preservations BPs.

* Section 8.3 : "Source formats, such as database dumps or spreadsheets,
used to generate the final published format, are out of scope" => It is the
first, and only, time "source formats" are mentioned and it is not clearly
said what are they a source for. This can be confusing and could be removed.

* BP12 : "CSV, NetCDF, XML, JSON and RDF" => RDF does not belong to this
list as it is a data model. This part could be changed into "CSV, NetCDF,
XML, JSON and RDF/XML." or even "CSV, NetCDF, XML, JSON, SDF and RDF/XML."
to also have a link to http://en.wikipedia.org/wiki/Simple_Data_Format . We
could also decide to use the term "serialization" instead of "format" to
decouple the data model from the way it is persisted.
Besides, from what Wikipedia just learnt me about NetCDF I would not put
too much trust in it as a data publisher. It is a binary serialisation of
data whose access depends mostly on the software developed by a single
entity. This sounds just as much open and trustable as .xls files (not
".xlsx", that's a different discussion).

* BP14 : "Check that the complete dataset is available in more than one
data format." => What is a "complete" dataset ?

* BP15 : the implementation is about machine-readable metadata and thus
differs from the intended outcome of having human-readable metadata. The
test also indicates "understanding" whereas the intended outcome aims at
"reading", understanding a document can be a whole lot more challenging
than reading it ;-)

* Section 8.6 : "data is distributed across multiple files" => As discussed
before, is it in our scope ?

* BP22 : "Hosting an API such as a REST or SOAP service" would better read
as "Hosting an API compliant with common  design principles and protocols
(e.g.  REST, SOAP, OAI-PMH, ...)". But I actually doubt SOAP is really
something we want to keep listed in this document. BP23 actually confirms
that REST is indeed what should be used. Maybe also a good idea to point to
LDP there.

* BP24 : "real-time means a range from milliseconds to a few seconds after
the data creation" => do we really want to give such a precise definition
of real-time in this document ? We could rather say that "real-time"
relates here to data which is not created on a specific schedule. I guess
that would be enough for the point being made in the BP.

* BP26 : Maybe move that BP together with BPs for data versioning ? "
example.org" is also a better dummy domain name than "myapi.org"


There are also a few typo and minor edits that I will suggest via a Git
pull.

As I will not be able to join for the vote I would also like to indicate my
support for getting this first version out.
I don't know if that can be counted as a valid "+1" but I think the
document is ready for getting feedback.

Thanks everyone for the great work! :)

Cheers,
Christophe

On 16 February 2015 at 19:29, Newton Calegari <newton@nic.br> wrote:

> Hello all,
>
> As was agreed in our last meeting, we (editors) would finish some changes
> on the document [1] and “freeze” it to let the group review during this
> week.
>
> So, Bernadette, Carol and I have made some changes considering Phil and
> Annette’s suggestions [2], and for now, we’re done with the modifications.
>
> Cheers,
> Bernadette, Carol and Newton
>
> [1] http://w3c.github.io/dwbp/bp.html
> [2] https://lists.w3.org/Archives/Public/public-dwbp-wg/2015Feb/0084.html
>



-- 
Onderzoeker
+31(0)6 14576494
christophe.gueret@dans.knaw.nl

*Data Archiving and Networked Services (DANS)*

DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op
www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en
NWO.


Let op, per 1 januari hebben we een nieuw adres:

DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB
Den Haag | +31 70 349 44 50 | info@dans.knaw.nl <info@dans.kn> |
www.dans.knaw.nl


*Let's build a World Wide Semantic Web!*
http://worldwidesemanticweb.org/

*e-Humanities Group (KNAW)*
[image: eHumanities] <http://www.ehumanities.nl/>
Received on Wednesday, 18 February 2015 11:57:47 UTC