rest of review of BP doc

Hi folks,
I've finished going through the entire BP document, and the following 
are my notes on the second half of the doc.
There is so much good stuff here, it is really getting to be a nice 
reference. I have a bunch of notes that I think we need to address for 
the next version, since doing so later will make life more difficult, 
especially for our hard-working editors. My apologies for not going 
through this at this level sooner.
Cheers,
-Annette


13. standardized formats
--

"machine-readable" is used differently here than in the metadata 
section. Technically, nothing on the web is not machine-readable. I 
think we could remove that phrase.

"adequate for its intended or potential use" doesn't really help in 
choosing. That's like saying "data on the web must be good."

The intended outcome should be more normative. "Data should be available 
in a standardized format that is easily parseable" belongs in the 
intended outcome. We could add that data should not be posted as an 
image unless the data itself encodes the image. (A jpeg file of a table 
is an image that encodes the data; RGB channel data from an imaging 
microscope is data that encodes an image.)

The example is metadata, not data. This BP is about formats for the data.


14. multiple formats
--
The example says John decided to use XML, but it shows ttl, and it shows 
metadata, not data.

The trend lately is toward doing a single format (json). Do we want to 
go against that trend? I note that the W3C's own API is json only.


Intro to Vocabularies
--

"possible subjects for books" is not a good example of a controlled 
vocabulary. There are two other good examples, so that could just be 
removed.

The first paragraph seems to be suggesting that controlled vocabularies 
enable easy translation, but it's confusingly phrased. The last three 
sentences could be changed to read "Standardized vocabularies can also 
serve to improve the usability of datasets. Say a dataset contains a 
reference to a concept described in a vocabulary that has been 
translated into several languages. Such a reference allows applications 
to localize their display of the data depending on the language of the 
user."

The last paragraph refers to "the former kind of vocabulary". It's not 
clear what kind that is. It's not clear what the point of that paragraph is.


15. standardized terms
--


16. Reuse vocabs
--

This still seems too similar to BP 16. See my separate email about this one.


17. right formalization
--

We say "a data publisher should opt for a level of formal semantics that 
fit data and applications." We don't want to tell people to use formal 
semantics, just to pick the right level or formalization. This should be 
changed to "a data publisher should opt for a level of semantic 
formalization that fits the data and its applications."

The intended outcome doesn't specifically address the issue of 
formalization.

The examples are about design of vocabularies, not about choosing them.

The test is hard to do. Finding and implementing an inference engine 
seems beyond the scope of publishing data.


Sensitive data intro
--

the discussion of sensitive data still needs a disclaimer, and the text 
should me more general rather than focused only on personal privacy.


18. data unavailability
--

"data unavailability reference" is awkard and unclear. Could we say 
"Provide an explanation for data that is not available."

We should address testing machine-readability by saying that a 
legitimate http response code in the 400 or 500 range should be returned.


Data acces intro
--

We say the web uses http by default and then say that different 
approaches can be adopted, including bulk download and APIs. Bulk 
download and APIs, of tar files or anything else, both use HTTP!

Next there is discussion of packaging in bulk using non-proprietary file 
formats (e.g., tar files). This has nothing to do with being 
nonproprietary. The point, I think, is archiving a directory structure 
into a single file.

Paragraph 3 is tautological. Data that is already streaming to the web 
is already published in a manner that allows immediate access. I think 
we mean to say "For data that is generated in real time or near real 
time, data publishers should use an automated system to enable immediate 
access to time-sensitive data, such as emergency information, weather 
forecasting data, or system monitoring metrics. In general, APIs should 
be available to allow third parties to automatically search and retrieve 
such data." If you want to then talk a little about APIs for other kinds 
of data, you could add a paragraph that goes like this:  "Aside from 
helping to automate real-time data pipelines, APIs are suitable for all 
kinds of data on the Web. Though they generally require more work than 
posting files for download, publishers are increasingly finding that 
delivering a well documented, standards-based, stable API is worth the 
effort."


19. bulk download
--

The intended outcome is focused on the wrong thing. It says "Bulk 
download will enable large file transfers (which would require more time 
than a typical user would consider reasonable) by dedicated 
file-transfer protocols." That's true, but it's not the point of the BP. 
The idea of allowing bulk download applies to datasets that are smaller 
as well as larger ones, and it need not involve alternative protocols. 
The outcome we are hoping for is that people will be able to easily 
download the data with a single request.

In the implementation section, the first bullet should clarify that it 
is about downloading. Making a request to one URI isn't unique to that 
bullet. (A bulk request to an API goes to one URI as well.) It should 
read "For datasets that exist initially as multiple files, preprocessing 
a copy of the data into a compressed archive format and making the data 
accessible for download from one URI."

The test should be about whether the full dataset can be retrieved with 
a single request, not whether the data is preprocessed. That test works 
for APIs as well as file downloads by humans.


20. provide subsets
--

Change "Static datasets that users in the domain would consider to be 
large will be downloadable in smaller pieces" to "Static datasets that 
take some time to download will be downloadable in smaller pieces" It's 
true that being large is dependent on what users in the domain consider 
to be large, but the issue here is time, not largeness.

The second example needs to be updated to use the new text I provided, 
using CSV, as follows:
"The MyCity transit agency has been collecting detailed data about 
passenger usage for several years. This is a very large dataset, 
containing values for numbers of passengers by transit type, route, 
vehicle, driver, entry stop, exit stop, transit pass type, entry time, 
etc.  They have found that a wide variety of stakeholders are interested 
in downloading various subsets of the data. The folks who run each 
transit system want only the data for their transit mode, the city 
planners only want the numbers of entries and exits at each stop, the 
city budget office wants only the numbers for the various types of 
passes sold, and others want still different subsets. The agency created 
a web site where users can select which variables are of interest to 
them, set ranges on some variables, and download only the subset they need."

How to test should say something about all the subets adding up to the 
complete set. Didn't we have a test before that the entire dataset can 
be recovered by making a series of smaller requests? I think we had a 
note that coming up with use cases isn't deterministic enough.


21. Content negotiation
--

Content negotiation should be in the implementation section for BP 14, 
multiple formats, rather than its own BP. I think we've already agreed 
to change this, but I'll just reiterate that I'm not yet convinced that 
always using conneg is a best practice for serving multiple formats from 
an API. I like the use of file extensions, because they allow one to 
reference a resource as a URI instead of a URI plus required headers 
(plus a note explaining how to set headers). I also think it's good to 
allow tests of an API using a browser when possible. Since browsers 
don't let you set request headers, relying on conneg alone prevents 
that. Using both addresses most objections, but many people prefer 
conneg because it allows them to get file extensions out of URIs. 
Implementing both doesn't accomplish that. For file downloads, I think 
conneg is a worst practice, because browsers don't allow users to set 
headers. Anyway, we could argue a long time on this. There is still a 
lot of disagreement about this stuff.


22. Real-time access
--

Subtitle: I still don't know what it means for data to be "produced in 
real time". The other day I posted some log data from a supercomputing 
system. That data is produced constantly, and it appears in the logs 
immediately when an event happens. That feels to me like real time, but 
I don't think it is appropriate to publish on the web in real time, 
because the purpose of posting is detailed analysis, not monitoring. On 
the other hand, preparing the log data for publishing is slow, so maybe 
that's the real measure. Maybe it should be "When data is released in 
real time . . ."

The intended outcome defines near real time as with a predetermined 
delay. The U.S. Census has a predetermined delay of 10 years, and that 
is not near real time. See 
https://en.wikipedia.org/wiki/Real-time_computing#Near_real-time for 
some help.

I don't understand the Push approach to implementation. I think the last 
word was intended to be publisher. "Disseminating" is vague and not 
particularly push-y, and making storage available is certainly not 
push-y. The last sentence of the implementation section is garbled. I 
think real-time data implementation is better broken into streaming or 
not streaming. It would be helpful to give some info about those 
alternatives.

The example doesn't use the transport agency, and it doesn't show how to 
implement real-time data. It would be more appropriate as an example of 
an API.

Mention of PROV-O in the test is unnecessary and off point. A more 
appropriate test might be to measure the refresh frequency and see that 
it matches the update frequency of the source data, and to measure the 
latency and see if it is in the real-time or near-real-time range.


23. up to date
--

The Why text is unclear as to what type of coincidence is desired and 
what should coincide with what. Similar to the real-time BP, I think the 
issue here is that the publication frequency should match the release 
frequency. Anyway, the first sentence is not about why. It belongs in 
the intended outcome.

I suggest the intended outcome says "Data on the web should be updated 
in a timely manner so that the most recent data available online 
reflects the most recent data released. When new data is released via 
any channel, it should be made available on the Web as soon as possible 
thereafter."

We could use a transit example about real-time bus arrival predictions.

The first sentence of the test reads like a note to ourselves to write a 
test. One step is "publish an updated version of data." That is not 
something one can do whenever a test is needed. More importantly, that 
test only determines whether there is a difference between two versions 
of the data. What it should be testing is the timeliness of the most 
recent data.
The test could be to check that the update frequency is stated and that 
the most recently published copy on the Web is no older than the stated 
update frequency.


24. use APIs
--

The test should say that a test client can simulate calls and the API 
returns the expected responses. (The test client doesn't simulate the 
responses.)


25. web standard APIs
--

We should have some references for REST.
Richardson, L. and Sam Ruby, RESTful Web Services, O'Reilly, 2007, 
http://restfulwebapis.org/rws.html.

Fielding, Roy T., "Representational State Transfer (REST)", Chapter 5 of 
Architectural Styles and
the Design of Network-based Software Architectures, Ph.D. Dissertation, 
University of California, Irvine, 2000, 
https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm.


26. document APIs
--

The intended outcome has a run-on sentence. The last sentence of the 
outcome starts with "It", which lacks an antecedent. (What is the "it"?)

The examples are all spatial data examples. None of them really makes 
sense in this context. We should probably offer examples for the 
transport agency.

Can we use a test like "time to first successful call"? That would 
require having volunteers to learn to use the API and timing them.


27. avoid breaking APIs
--

In the implementation section, "home resource URI" gets used as a 
plural, but an API should only have one. Remove "home" in the first one 
and it makes more sense. "...by keeping resource URIs constant..."

The bit about announcing changes should go in the outcome section.


28. assess coverage
--

This is about archiving, not publishing, and is out of scope.

I disagree that the meaning of a resource is dependent on the rest of 
the global graph. The web was invented to enable links to be added 
without changing the thing linked to. What the resource itself links to 
affects its meaning, however.

The test is all about developing a score, which I guess is for 
determining what to archive? That doesn't seem to be about publishing.


29. trusted serialization
--

If we keep this, it should at least offer JSON as an acceptable example. 
JSON is the current overwhelming standard for APIs.

This talks about "sending data dumps for long-term preservation" and 
"data depositors". Where are the data being sent? Is it on the Web?

The bad example would pass the How to Test.


30. update identifiers
--

It's not quite clear what we are suggesting get linked to what. The Why 
talks about linking preserved datasets with the original URI. Are we 
saying the original URI should continue to point to the preserved 
dataset? If that's the case, then what does preservation mean? There is 
also discussion of saving snapshots as versions, which seems to me is 
covered better under versioning.

We say "A link is maintained between the URI of a resource, the most 
up-to-date description available for it, and preserved descriptions." 
One link can only join two resources. Should people preserve old 
descriptions? Maybe descriptions of older versions are what was meant?

A 410 status only makes sense if there's nothing served at the URI, 
which isn't the case if the advice here is followed. 303 seems like a 
good option.


Feedback intro
--

Second paragraph: The word blog is misused. A blog is a web site for one 
person to serially publish comments, not for many people to enter single 
comments.
Also, I disagree with this sentence: "In order to quantify and analyze 
usage feedback, it should be recorded in a machine-readable format." I 
think using automated tools to gather feedback and store it in a 
searchable way is a good idea, but saying the feedback should be machine 
readable is misleading and insufficiently specific. If you have 
succeeded in posting feedback on the web, it is machine readable by 
definition. It sounds like we are telling people to publish their 
feedback as another dataset. You may want to store it in a 
machine-readable way for the purpose of displaying it to other humans, 
but there's no reason to *publish* the feedback with machines in mind.


31. Gather feedback
--
This BP includes recommendations about making feedback public, but 
that's handled in the next BP. We should keep this BP focused on 
enabling feedback.

The first sentence of the Why needs rewriting. We should remove the word 
"providing" at the beginning. The BP is about collecting feedback, not 
providing it. It should address the value of setting up a specific way 
of collecting feedback (makes it easier for consumers to contribute).

Remove the mention of machine-readable formats and using a vocabulary 
for capturing the semantics of the feedback information. Instead, 
suggest using an automated feedback system, such as a bug tracker.

How to test, the first bullet is a note to us, I guess. The second is 
partially about the next BP. The third is again treating the feedback 
data as another published dataset. There's nothing wrong with publishing 
such a dataset, but that's not the idea here. A real test would be 
whether a consumer is able to find a way to provide feedback.


32. Publish feedback

The subtitle should say "Feedback should be available for human users." 
There is no expectation that feedback be provided as a dataset for 
consumption by systems outside the publisher. (It takes a bit of trust 
for a publisher to put the info out for human consumption, and making it 
processable by external systems can be seen as breaching that trust.)

The Why should mention avoiding duplication and being transparent about 
the quality of the data.

The intended outcome is tautological. It should include the idea that 
consumers should be able to review issues already raised by others, 
saving them the trouble of filing duplicate bug reports. Publishing 
feedback also helps consumers understand any issues that may affect 
their ability to use the data.

The implementation section need to be changed. We should not be telling 
people that they need to present their feedback in machine readable form.

The test is again about metadata for the feedback as a dataset. 
Publishing your feedback as a dataset is not a best practice.


33. enrich data
--

I'm concerned that we need to be more explicit about the pitfalls of 
generating new values. For scientific data, this is a VERY touchy 
subject. People have lost jobs and scientific careers by doing that.

The Why needs a few caveats. "Under some circumstances, missing values 
can be filled in, and ..." "Publishing more complete datasets can 
enhance trust, if done properly and ethically."

In the intended outcome, "should be enhanced if possible" is too strong. 
The first paragraph could be "Data that is unstructured should be given 
structure if possible. Additional derived measures or attributes should 
be added if they enhance utility. A dataset that has missing values can 
be enhanced to fill in those values if the addition does not distort 
analytical results, significance, or statistical power."


37. cite source
--

The first line of the example ("You can cite the original...") should 
replace the text above it ("You can use the Dataset Usage...")
The example citation should list the transit agency as the author.
'Data source: MyCity Transport Agency, "Bus Timetable of MyCity...'



glossary
--

The link to CITO under "citation" doesn't go directly to cito. It should 
probably go to http://www.sparontologies.net/ontologies/cito/source.html

The definition of locale needs to mention geographic location.

The definition of machine readable data surprises me. I think 
proprietary formats are machine readable, too. If we want to steer 
people away from proprietary formats, we should do that explicitly.


Challenges
--

In the diagram, the challenge texts should be similar, either all 
statements or all questions. Suggestion for the reuse one: "How can I 
reuse responsibly?" (The current question sounds a little too self-serving.)



-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory

Received on Thursday, 21 April 2016 01:20:16 UTC