- From: Annette Greiner <amgreiner@lbl.gov>
- Date: Wed, 20 Apr 2016 18:19:44 -0700
- To: DWBP Public List <public-dwbp-wg@w3.org>
Hi folks,
I've finished going through the entire BP document, and the following
are my notes on the second half of the doc.
There is so much good stuff here, it is really getting to be a nice
reference. I have a bunch of notes that I think we need to address for
the next version, since doing so later will make life more difficult,
especially for our hard-working editors. My apologies for not going
through this at this level sooner.
Cheers,
-Annette
13. standardized formats
--
"machine-readable" is used differently here than in the metadata
section. Technically, nothing on the web is not machine-readable. I
think we could remove that phrase.
"adequate for its intended or potential use" doesn't really help in
choosing. That's like saying "data on the web must be good."
The intended outcome should be more normative. "Data should be available
in a standardized format that is easily parseable" belongs in the
intended outcome. We could add that data should not be posted as an
image unless the data itself encodes the image. (A jpeg file of a table
is an image that encodes the data; RGB channel data from an imaging
microscope is data that encodes an image.)
The example is metadata, not data. This BP is about formats for the data.
14. multiple formats
--
The example says John decided to use XML, but it shows ttl, and it shows
metadata, not data.
The trend lately is toward doing a single format (json). Do we want to
go against that trend? I note that the W3C's own API is json only.
Intro to Vocabularies
--
"possible subjects for books" is not a good example of a controlled
vocabulary. There are two other good examples, so that could just be
removed.
The first paragraph seems to be suggesting that controlled vocabularies
enable easy translation, but it's confusingly phrased. The last three
sentences could be changed to read "Standardized vocabularies can also
serve to improve the usability of datasets. Say a dataset contains a
reference to a concept described in a vocabulary that has been
translated into several languages. Such a reference allows applications
to localize their display of the data depending on the language of the
user."
The last paragraph refers to "the former kind of vocabulary". It's not
clear what kind that is. It's not clear what the point of that paragraph is.
15. standardized terms
--
16. Reuse vocabs
--
This still seems too similar to BP 16. See my separate email about this one.
17. right formalization
--
We say "a data publisher should opt for a level of formal semantics that
fit data and applications." We don't want to tell people to use formal
semantics, just to pick the right level or formalization. This should be
changed to "a data publisher should opt for a level of semantic
formalization that fits the data and its applications."
The intended outcome doesn't specifically address the issue of
formalization.
The examples are about design of vocabularies, not about choosing them.
The test is hard to do. Finding and implementing an inference engine
seems beyond the scope of publishing data.
Sensitive data intro
--
the discussion of sensitive data still needs a disclaimer, and the text
should me more general rather than focused only on personal privacy.
18. data unavailability
--
"data unavailability reference" is awkard and unclear. Could we say
"Provide an explanation for data that is not available."
We should address testing machine-readability by saying that a
legitimate http response code in the 400 or 500 range should be returned.
Data acces intro
--
We say the web uses http by default and then say that different
approaches can be adopted, including bulk download and APIs. Bulk
download and APIs, of tar files or anything else, both use HTTP!
Next there is discussion of packaging in bulk using non-proprietary file
formats (e.g., tar files). This has nothing to do with being
nonproprietary. The point, I think, is archiving a directory structure
into a single file.
Paragraph 3 is tautological. Data that is already streaming to the web
is already published in a manner that allows immediate access. I think
we mean to say "For data that is generated in real time or near real
time, data publishers should use an automated system to enable immediate
access to time-sensitive data, such as emergency information, weather
forecasting data, or system monitoring metrics. In general, APIs should
be available to allow third parties to automatically search and retrieve
such data." If you want to then talk a little about APIs for other kinds
of data, you could add a paragraph that goes like this: "Aside from
helping to automate real-time data pipelines, APIs are suitable for all
kinds of data on the Web. Though they generally require more work than
posting files for download, publishers are increasingly finding that
delivering a well documented, standards-based, stable API is worth the
effort."
19. bulk download
--
The intended outcome is focused on the wrong thing. It says "Bulk
download will enable large file transfers (which would require more time
than a typical user would consider reasonable) by dedicated
file-transfer protocols." That's true, but it's not the point of the BP.
The idea of allowing bulk download applies to datasets that are smaller
as well as larger ones, and it need not involve alternative protocols.
The outcome we are hoping for is that people will be able to easily
download the data with a single request.
In the implementation section, the first bullet should clarify that it
is about downloading. Making a request to one URI isn't unique to that
bullet. (A bulk request to an API goes to one URI as well.) It should
read "For datasets that exist initially as multiple files, preprocessing
a copy of the data into a compressed archive format and making the data
accessible for download from one URI."
The test should be about whether the full dataset can be retrieved with
a single request, not whether the data is preprocessed. That test works
for APIs as well as file downloads by humans.
20. provide subsets
--
Change "Static datasets that users in the domain would consider to be
large will be downloadable in smaller pieces" to "Static datasets that
take some time to download will be downloadable in smaller pieces" It's
true that being large is dependent on what users in the domain consider
to be large, but the issue here is time, not largeness.
The second example needs to be updated to use the new text I provided,
using CSV, as follows:
"The MyCity transit agency has been collecting detailed data about
passenger usage for several years. This is a very large dataset,
containing values for numbers of passengers by transit type, route,
vehicle, driver, entry stop, exit stop, transit pass type, entry time,
etc. They have found that a wide variety of stakeholders are interested
in downloading various subsets of the data. The folks who run each
transit system want only the data for their transit mode, the city
planners only want the numbers of entries and exits at each stop, the
city budget office wants only the numbers for the various types of
passes sold, and others want still different subsets. The agency created
a web site where users can select which variables are of interest to
them, set ranges on some variables, and download only the subset they need."
How to test should say something about all the subets adding up to the
complete set. Didn't we have a test before that the entire dataset can
be recovered by making a series of smaller requests? I think we had a
note that coming up with use cases isn't deterministic enough.
21. Content negotiation
--
Content negotiation should be in the implementation section for BP 14,
multiple formats, rather than its own BP. I think we've already agreed
to change this, but I'll just reiterate that I'm not yet convinced that
always using conneg is a best practice for serving multiple formats from
an API. I like the use of file extensions, because they allow one to
reference a resource as a URI instead of a URI plus required headers
(plus a note explaining how to set headers). I also think it's good to
allow tests of an API using a browser when possible. Since browsers
don't let you set request headers, relying on conneg alone prevents
that. Using both addresses most objections, but many people prefer
conneg because it allows them to get file extensions out of URIs.
Implementing both doesn't accomplish that. For file downloads, I think
conneg is a worst practice, because browsers don't allow users to set
headers. Anyway, we could argue a long time on this. There is still a
lot of disagreement about this stuff.
22. Real-time access
--
Subtitle: I still don't know what it means for data to be "produced in
real time". The other day I posted some log data from a supercomputing
system. That data is produced constantly, and it appears in the logs
immediately when an event happens. That feels to me like real time, but
I don't think it is appropriate to publish on the web in real time,
because the purpose of posting is detailed analysis, not monitoring. On
the other hand, preparing the log data for publishing is slow, so maybe
that's the real measure. Maybe it should be "When data is released in
real time . . ."
The intended outcome defines near real time as with a predetermined
delay. The U.S. Census has a predetermined delay of 10 years, and that
is not near real time. See
https://en.wikipedia.org/wiki/Real-time_computing#Near_real-time for
some help.
I don't understand the Push approach to implementation. I think the last
word was intended to be publisher. "Disseminating" is vague and not
particularly push-y, and making storage available is certainly not
push-y. The last sentence of the implementation section is garbled. I
think real-time data implementation is better broken into streaming or
not streaming. It would be helpful to give some info about those
alternatives.
The example doesn't use the transport agency, and it doesn't show how to
implement real-time data. It would be more appropriate as an example of
an API.
Mention of PROV-O in the test is unnecessary and off point. A more
appropriate test might be to measure the refresh frequency and see that
it matches the update frequency of the source data, and to measure the
latency and see if it is in the real-time or near-real-time range.
23. up to date
--
The Why text is unclear as to what type of coincidence is desired and
what should coincide with what. Similar to the real-time BP, I think the
issue here is that the publication frequency should match the release
frequency. Anyway, the first sentence is not about why. It belongs in
the intended outcome.
I suggest the intended outcome says "Data on the web should be updated
in a timely manner so that the most recent data available online
reflects the most recent data released. When new data is released via
any channel, it should be made available on the Web as soon as possible
thereafter."
We could use a transit example about real-time bus arrival predictions.
The first sentence of the test reads like a note to ourselves to write a
test. One step is "publish an updated version of data." That is not
something one can do whenever a test is needed. More importantly, that
test only determines whether there is a difference between two versions
of the data. What it should be testing is the timeliness of the most
recent data.
The test could be to check that the update frequency is stated and that
the most recently published copy on the Web is no older than the stated
update frequency.
24. use APIs
--
The test should say that a test client can simulate calls and the API
returns the expected responses. (The test client doesn't simulate the
responses.)
25. web standard APIs
--
We should have some references for REST.
Richardson, L. and Sam Ruby, RESTful Web Services, O'Reilly, 2007,
http://restfulwebapis.org/rws.html.
Fielding, Roy T., "Representational State Transfer (REST)", Chapter 5 of
Architectural Styles and
the Design of Network-based Software Architectures, Ph.D. Dissertation,
University of California, Irvine, 2000,
https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm.
26. document APIs
--
The intended outcome has a run-on sentence. The last sentence of the
outcome starts with "It", which lacks an antecedent. (What is the "it"?)
The examples are all spatial data examples. None of them really makes
sense in this context. We should probably offer examples for the
transport agency.
Can we use a test like "time to first successful call"? That would
require having volunteers to learn to use the API and timing them.
27. avoid breaking APIs
--
In the implementation section, "home resource URI" gets used as a
plural, but an API should only have one. Remove "home" in the first one
and it makes more sense. "...by keeping resource URIs constant..."
The bit about announcing changes should go in the outcome section.
28. assess coverage
--
This is about archiving, not publishing, and is out of scope.
I disagree that the meaning of a resource is dependent on the rest of
the global graph. The web was invented to enable links to be added
without changing the thing linked to. What the resource itself links to
affects its meaning, however.
The test is all about developing a score, which I guess is for
determining what to archive? That doesn't seem to be about publishing.
29. trusted serialization
--
If we keep this, it should at least offer JSON as an acceptable example.
JSON is the current overwhelming standard for APIs.
This talks about "sending data dumps for long-term preservation" and
"data depositors". Where are the data being sent? Is it on the Web?
The bad example would pass the How to Test.
30. update identifiers
--
It's not quite clear what we are suggesting get linked to what. The Why
talks about linking preserved datasets with the original URI. Are we
saying the original URI should continue to point to the preserved
dataset? If that's the case, then what does preservation mean? There is
also discussion of saving snapshots as versions, which seems to me is
covered better under versioning.
We say "A link is maintained between the URI of a resource, the most
up-to-date description available for it, and preserved descriptions."
One link can only join two resources. Should people preserve old
descriptions? Maybe descriptions of older versions are what was meant?
A 410 status only makes sense if there's nothing served at the URI,
which isn't the case if the advice here is followed. 303 seems like a
good option.
Feedback intro
--
Second paragraph: The word blog is misused. A blog is a web site for one
person to serially publish comments, not for many people to enter single
comments.
Also, I disagree with this sentence: "In order to quantify and analyze
usage feedback, it should be recorded in a machine-readable format." I
think using automated tools to gather feedback and store it in a
searchable way is a good idea, but saying the feedback should be machine
readable is misleading and insufficiently specific. If you have
succeeded in posting feedback on the web, it is machine readable by
definition. It sounds like we are telling people to publish their
feedback as another dataset. You may want to store it in a
machine-readable way for the purpose of displaying it to other humans,
but there's no reason to *publish* the feedback with machines in mind.
31. Gather feedback
--
This BP includes recommendations about making feedback public, but
that's handled in the next BP. We should keep this BP focused on
enabling feedback.
The first sentence of the Why needs rewriting. We should remove the word
"providing" at the beginning. The BP is about collecting feedback, not
providing it. It should address the value of setting up a specific way
of collecting feedback (makes it easier for consumers to contribute).
Remove the mention of machine-readable formats and using a vocabulary
for capturing the semantics of the feedback information. Instead,
suggest using an automated feedback system, such as a bug tracker.
How to test, the first bullet is a note to us, I guess. The second is
partially about the next BP. The third is again treating the feedback
data as another published dataset. There's nothing wrong with publishing
such a dataset, but that's not the idea here. A real test would be
whether a consumer is able to find a way to provide feedback.
32. Publish feedback
The subtitle should say "Feedback should be available for human users."
There is no expectation that feedback be provided as a dataset for
consumption by systems outside the publisher. (It takes a bit of trust
for a publisher to put the info out for human consumption, and making it
processable by external systems can be seen as breaching that trust.)
The Why should mention avoiding duplication and being transparent about
the quality of the data.
The intended outcome is tautological. It should include the idea that
consumers should be able to review issues already raised by others,
saving them the trouble of filing duplicate bug reports. Publishing
feedback also helps consumers understand any issues that may affect
their ability to use the data.
The implementation section need to be changed. We should not be telling
people that they need to present their feedback in machine readable form.
The test is again about metadata for the feedback as a dataset.
Publishing your feedback as a dataset is not a best practice.
33. enrich data
--
I'm concerned that we need to be more explicit about the pitfalls of
generating new values. For scientific data, this is a VERY touchy
subject. People have lost jobs and scientific careers by doing that.
The Why needs a few caveats. "Under some circumstances, missing values
can be filled in, and ..." "Publishing more complete datasets can
enhance trust, if done properly and ethically."
In the intended outcome, "should be enhanced if possible" is too strong.
The first paragraph could be "Data that is unstructured should be given
structure if possible. Additional derived measures or attributes should
be added if they enhance utility. A dataset that has missing values can
be enhanced to fill in those values if the addition does not distort
analytical results, significance, or statistical power."
37. cite source
--
The first line of the example ("You can cite the original...") should
replace the text above it ("You can use the Dataset Usage...")
The example citation should list the transit agency as the author.
'Data source: MyCity Transport Agency, "Bus Timetable of MyCity...'
glossary
--
The link to CITO under "citation" doesn't go directly to cito. It should
probably go to http://www.sparontologies.net/ontologies/cito/source.html
The definition of locale needs to mention geographic location.
The definition of machine readable data surprises me. I think
proprietary formats are machine readable, too. If we want to steer
people away from proprietary formats, we should do that explicitly.
Challenges
--
In the diagram, the challenge texts should be similar, either all
statements or all questions. Suggestion for the reuse one: "How can I
reuse responsibly?" (The current question sounds a little too self-serving.)
--
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
Received on Thursday, 21 April 2016 01:20:16 UTC