reviewing the BP doc

Hi folks,
I gave the BP doc a read-through today and noticed a few things that I think we should address. Sorry this got rather long; it was really short when I started it! I am finding the document is getting better and better, and many of the new changes are things that make it better still. Much of what follows is stuff I think we can address in a later version. I’ll put an asterisk at the beginning of each note that I think is important to consider before publishing the next version.

Provide metadata
How to Test: we say “access the same URL” to test machine readability. I think it’s fine for the machine-readable metadata to be separate, under a different URL.

Provide descriptive metadata
Possible approach to implementation: The list of metadata to be included is not an implementation, so that should be moved up and listed under intended outcome. Spatial coverage and temporal period are irrelevant for lots of datasets, so they should be marked “if relevant". Keywords and themes/categories are dependent on the context of a catalog, so I think we should leave them out of this list.

Data Quality
The introduction says that quality “can affect the potentiality of the application that use data”. I don’t understand that phrase. The text under Evidence should use the same format as in other BPs.

* Data Versioning
The chart describes time series data, not versions of data. I would say that, if released independently, the items in yellow each represent a different dataset (they report different data points), not a different version. If you revised any of them, then the original and the revision would be different versions. I think by definition, versions attempt to report the same data.

Data Identification
The introductory text about URIs and URLs and IRIs is potentially confusing and not necessary for our audience to understand the BPs about identifiers. Also, URLs are for for the internet, not just the web. I also disagree with the representation of DOIs as something that cannot be looked up, though the question is not something I think we should make readers think about.
* I would like this section to limit itself to information that applies to publishing *data*. The BP is about assigning persistent identifiers to datasets, but the possible approach to implementation is about much more than that. The list items are also not consistent. (one shows use of extensions, another says not to do that). I worry that this will open up a holy war about how to implement a REST API.

Use machine-readable standardized data formats
Possible approach to implementation: the first sentence is about choosing a format that your users will be able to parse. The second sentence mentions being nonproprietary. Proprietariness is a different issue and separate BP. We should say “commonly used standard formats include, but are not limited to, CSV, …"

 Use standardized terms
should be “Standardized terms should be used to provide metadata whenever they are available.” In scientific domains, often there are no standard terms yet available.

Other vocabularies-related BPs, I don’t think they are in scope. Vocabularies are not data.

Preserve people’s right to privacy
I agree with the concept completely, but we need to be careful about how we state things. Can data about famous people be published without their consent? What about public acts? I think the current intended outcome is too restrictive. Maybe “Data that can identify a private individual must not be published without their consent.”

Security can involve more than protecting privacy. I think we should have a separate BP for security.

We have a BP saying to use REST for APIs, but we don’t have one saying to make data available with an API. That strikes me as odd.

Provide real-time access
We say “where data is produced in real time” where I think we mean “where data changes in real time.” It seems like real-time data is always published by the producer. I can’t think of an example where a publisher would be waiting for a producer to give them real-time data.

Provide data up to date
* I think this needs editing. It’s difficult to understand the actual requirement. At times it sounds like we are saying all data should be published immediately, which is impractical for many publishers. I think the goal should be to adhere to a published schedule for updates.

Assess dataset coverage
I disagree that a chunk of web data is by definition dependent on the rest of the global graph. I think a good dataset is sufficiently self-explanatory that it can be used fully without requiring some other pieces of the web to be present. One should avoid dependencies on external resources that are not expected to persist.

Use a trusted serialization format for preserved data dumps
To the extent that this is in scope, it is covered under the BP about using standardized formats. We could add a note to that mentioning the value for preservation. I don’t think this needs to be a separate BP.

Update the status of identifiers
To the extent that this is in scope, it should be covered under versioning or unavailability. What are “preserved” datasets? Are they available on the web? If not, it is out of scope. If they are, then they are versions.

We say “blogs and other publicly available feedback should be displayed in a human-readable form through the user interface.” That suggests that publishers should re-publish blog content, which is probably not what we want (copyright issues, for one thing). Publishers of data can’t control the format of other people’s publications. 

Gather feedback from data consumers
Possible approach to implementation: registration is not feedback. I don’t think filling in a comment box is properly referred to as “blogging”.

Data enrichment
* Enrichment yields derived data, not just metadata. For example, you could take a dataset of scheduled and real bus arrival times and enrich it by adding on-time arrival percentages. The percentages are data, not metadata.
I don’t know the meaning of the word “topification”.

I would still like to see mention of the value of enabling users to grab subsets of data.

Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory

Received on Friday, 19 June 2015 02:04:09 UTC