- From: Christophe Guéret <christophe.gueret@dans.knaw.nl>
- Date: Wed, 18 Feb 2015 12:56:58 +0100
- To: Newton Calegari <newton@nic.br>
- CC: DWBP Public List <public-dwbp-wg@w3.org>
- Message-ID: <CABP9CAFtcRSQgBPdV-3L_hmnOzepmYgHtr5KeD61WVk4Gb_xtA@mail.gmail.com>
Thanks Newton & the team ! After a read through all the documents I think there is still two big issues with it: * We speak of data "on the Web" but do not give any concrete example of it leaving some doubt about whether putting a CSV on a web site is "data on the Web" or if the document is more about LD-like work, or both. In Section 8.3, saying "Source formats, such as database dumps or spreadsheets, used to generate the final published format, are out of scope" suggest that we care for the later whereas "However, when data is distributed across multiple files" in Section 8.6 relates more to the publish-as-files paradigm. There are also several points raised about publishing additional documentation as HTML documents... We could probably fix that by saying we aim at using the Web as a platform first, and publishing dumps when necessary/applicable (e.g. DBpedia-like with dumps on the project page and LD access). All that together with HTML documentation. Adding some concrete (fictive) examples could also help make it clear what the document is about. * It is not said if the best practices can also apply to closed data, and how. The part about privacy deals with exposing potentially sensitive data as open data but do not provide guidance on how a company could implement the DWBP internally to manage its core business data. For this I'd suggest to highlight that using the Web as a data publishing platform allows for the usage of long established strategies for having closed Web systems. Here are more specific points : * "Data consumers (who may also be producers themselves) want to be able to find and use data, especially if it is accurate, regularly updated and guaranteed to be available at all times." : digital archives are typically seen as a integral part of the "all time availability". Data publishers are typically not in a position of doing that, unless they have an archive on their own of course, and may rely on external parties to help them. We could have some BPs to foster such internal or external collaboration. * BP5 : "Where an international format specification exists, e.g., ISO 8601 for dates and times, use it." also applies to the data itself. Shall we extract that point and make it a more general BP on its own ? * BP6 : "the presence of an RDF predicate" could be changed into "the presence of license-related vocabulary" to avoid letting the reader think this only applies to data published in RDF. For data published as Zip or Tar files (as suggested later in the doc) it could also be suggested to test for license embedded as a separate document in the package. * BP7 : "Data provenance is metadata that corresponds to data. Data provenance relies upon existing vocabularies that make provenance easily identifiable such as the Provenance Ontology [PROV-O <http://w3c.github.io/dwbp/bp.html#bib-PROV-O>]." is more an implementation suggestion than a "Why" * BP11 : "Avoid broken URIs. In the event that a resource has been modified or deleted, those changes must be communicated using the appropriate response code [RFC7231 <http://w3c.github.io/dwbp/bp.html#bib-RFC7231>]. If the resource has changed location, HTTP 3XX codes should be used, whereas if the resource has been deleted a HTTP 410 code should be used." => This was tackled in one of the preservation BP together with other codes to relate "live" description to their preserved historical counterparts. If we decide that keeping URIs in a good state and giving people access to historical version is in scope we may want to look at the potential overlap between this point and preservations BPs. * Section 8.3 : "Source formats, such as database dumps or spreadsheets, used to generate the final published format, are out of scope" => It is the first, and only, time "source formats" are mentioned and it is not clearly said what are they a source for. This can be confusing and could be removed. * BP12 : "CSV, NetCDF, XML, JSON and RDF" => RDF does not belong to this list as it is a data model. This part could be changed into "CSV, NetCDF, XML, JSON and RDF/XML." or even "CSV, NetCDF, XML, JSON, SDF and RDF/XML." to also have a link to http://en.wikipedia.org/wiki/Simple_Data_Format . We could also decide to use the term "serialization" instead of "format" to decouple the data model from the way it is persisted. Besides, from what Wikipedia just learnt me about NetCDF I would not put too much trust in it as a data publisher. It is a binary serialisation of data whose access depends mostly on the software developed by a single entity. This sounds just as much open and trustable as .xls files (not ".xlsx", that's a different discussion). * BP14 : "Check that the complete dataset is available in more than one data format." => What is a "complete" dataset ? * BP15 : the implementation is about machine-readable metadata and thus differs from the intended outcome of having human-readable metadata. The test also indicates "understanding" whereas the intended outcome aims at "reading", understanding a document can be a whole lot more challenging than reading it ;-) * Section 8.6 : "data is distributed across multiple files" => As discussed before, is it in our scope ? * BP22 : "Hosting an API such as a REST or SOAP service" would better read as "Hosting an API compliant with common design principles and protocols (e.g. REST, SOAP, OAI-PMH, ...)". But I actually doubt SOAP is really something we want to keep listed in this document. BP23 actually confirms that REST is indeed what should be used. Maybe also a good idea to point to LDP there. * BP24 : "real-time means a range from milliseconds to a few seconds after the data creation" => do we really want to give such a precise definition of real-time in this document ? We could rather say that "real-time" relates here to data which is not created on a specific schedule. I guess that would be enough for the point being made in the BP. * BP26 : Maybe move that BP together with BPs for data versioning ? " example.org" is also a better dummy domain name than "myapi.org" There are also a few typo and minor edits that I will suggest via a Git pull. As I will not be able to join for the vote I would also like to indicate my support for getting this first version out. I don't know if that can be counted as a valid "+1" but I think the document is ready for getting feedback. Thanks everyone for the great work! :) Cheers, Christophe On 16 February 2015 at 19:29, Newton Calegari <newton@nic.br> wrote: > Hello all, > > As was agreed in our last meeting, we (editors) would finish some changes > on the document [1] and “freeze” it to let the group review during this > week. > > So, Bernadette, Carol and I have made some changes considering Phil and > Annette’s suggestions [2], and for now, we’re done with the modifications. > > Cheers, > Bernadette, Carol and Newton > > [1] http://w3c.github.io/dwbp/bp.html > [2] https://lists.w3.org/Archives/Public/public-dwbp-wg/2015Feb/0084.html > -- Onderzoeker +31(0)6 14576494 christophe.gueret@dans.knaw.nl *Data Archiving and Networked Services (DANS)* DANS bevordert duurzame toegang tot digitale onderzoeksgegevens. Kijk op www.dans.knaw.nl voor meer informatie. DANS is een instituut van KNAW en NWO. Let op, per 1 januari hebben we een nieuw adres: DANS | Anna van Saksenlaan 51 | 2593 HW Den Haag | Postbus 93067 | 2509 AB Den Haag | +31 70 349 44 50 | info@dans.knaw.nl <info@dans.kn> | www.dans.knaw.nl *Let's build a World Wide Semantic Web!* http://worldwidesemanticweb.org/ *e-Humanities Group (KNAW)* [image: eHumanities] <http://www.ehumanities.nl/>
Received on Wednesday, 18 February 2015 11:57:47 UTC