Re: [dwbp] BP Document ready to be reviewed from Annette Greiner on 2015-02-20 (public-dwbp-wg@w3.org from February 2015)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Thu, 19 Feb 2015 18:33:02 -0800
To: Newton Calegari <newton@nic.br>
Cc: DWBP Public List <public-dwbp-wg@w3.org>
Message-Id: <2E167E2E-4A55-47CD-AAAE-377B5985FCF1@lbl.gov>
Great work, editors, on getting this into better shape! I have some more input, but it’s actually small changes that I’m suggesting, or changes that we can do in the next version.

For the BP about providing data in multiple formats, I’d still like to add the word “consumer” in the Why section, so that it reads "Providing data in more than one format reduces *consumer* costs incurred in data transformation." I don't think it's controversial, and it addresses an issue someone other than me brought up.

Christophe had a good point about the Data Formats intro, regarding the sentence that goes "Source formats, such as database dumps or spreadsheets, used to generate the final published format, are out of scope." I believe the idea was to say that data sources that aren't published online are out of scope. I suggest changing that sentence to say "Source formats, such as database dumps or spreadsheets, that are never published online but are used to generate the published data, are out of scope." I'm fine with removing it, too, though.

Laufer pointed out the oddity of the statement under "Provide versioning information" and elsewhere that "This best practice is a specialization of the higher level Provide metadata for machines." That best practice applies to humans as well as machines, so I don't think that text should be there. The same problem applies to licensing, provenance, and data quality.

In general, I think we should state that metadata needs to be provided in human and machine-readable form in one BP and leave that out of the other metadata BPs. Repeating that over again for each one feels pedantic and makes the list less concise.

Regarding Christophe's question of whether we are writing BPs for the publish-as-files paradigm or more for LD-like work, I think it is both. In fact, I think it also includes APIs that are not REST, though I do agree with having a BP that suggests people should use REST. I don't want our work to seem irrelevant for people who are not doing the LD thing.

Regarding open/closed data, I think we should write BPs that apply to both. I want our work not to be irrelevant for those who must restrict access to their data. In the introduction, we should state that the BPs generally apply to both but mention that we favor open data whenever possible. Then we should possibly add a BP stating that publishers should make data open whenever possible (in a future release).

Christophe suggested generalizing the bit in BP5 about using an international format specification if it exists. I support the idea of making a new BP that is about that, in a future version. I'm afraid that what I wrote is sort of confounding the idea of providing the locale parameters and selecting a format.

Now that I look at it, I disagree with the wording of BP3, "Metadata should be provided using standard vocabularies" and "Metadata is best provided using RDF". Much scientific data has no standard vocabulary and does not lend itself readily to RDF vocabularies. I would not currently recommend that a research team spend time turning all its metadata into RDF triples to satisfy such a requirement. I would be okay with saying "Metadata should be provided using standard vocabularies *where they are available*" and, in the possible approach to implementation, "Metadata can be provided using RDF vocabularies". 

Christophe also rightly points out that RDF is different from the other data formats in BP12, though it certainly is a format for data, albeit data about data. I don't see what changing it to RDF/XML improves, since we already list XML. I don't feel all that strongly about listing it here, though. As for SDF, that is not common enough IMHO to be a recommended format, it's really just a combination of CSV and JSON, and it apparently is now called Tabular Data Package. NetCDF is one of the very short list of commonly used scientific computing formats. It's actually based on HDF5 now, which is open source, so maybe it would make more sense to cite HDF5. If we want to appear considerate to scientific computing users, we should include one or the other. My vote is for HDF5.

Christophe asked what is meant by "the complete dataset" in BP 14. The idea is to make sure that no data or metadata that is provided in one format is missing from the others.

-Annette

--
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
510-495-2935

On Feb 16, 2015, at 10:29 AM, Newton Calegari <newton@nic.br> wrote:

> Hello all,
> 
> As was agreed in our last meeting, we (editors) would finish some changes on the document [1] and “freeze” it to let the group review during this week.
> 
> So, Bernadette, Carol and I have made some changes considering Phil and Annette’s suggestions [2], and for now, we’re done with the modifications. 
> 
> Cheers,
> Bernadette, Carol and Newton
> 
> [1] http://w3c.github.io/dwbp/bp.html
> [2] https://lists.w3.org/Archives/Public/public-dwbp-wg/2015Feb/0084.html
Received on Friday, 20 February 2015 02:33:51 UTC