- From: Annette Greiner <amgreiner@lbl.gov>
- Date: Wed, 20 Apr 2016 18:19:44 -0700
- To: DWBP Public List <public-dwbp-wg@w3.org>
Hi folks, I've finished going through the entire BP document, and the following are my notes on the second half of the doc. There is so much good stuff here, it is really getting to be a nice reference. I have a bunch of notes that I think we need to address for the next version, since doing so later will make life more difficult, especially for our hard-working editors. My apologies for not going through this at this level sooner. Cheers, -Annette 13. standardized formats -- "machine-readable" is used differently here than in the metadata section. Technically, nothing on the web is not machine-readable. I think we could remove that phrase. "adequate for its intended or potential use" doesn't really help in choosing. That's like saying "data on the web must be good." The intended outcome should be more normative. "Data should be available in a standardized format that is easily parseable" belongs in the intended outcome. We could add that data should not be posted as an image unless the data itself encodes the image. (A jpeg file of a table is an image that encodes the data; RGB channel data from an imaging microscope is data that encodes an image.) The example is metadata, not data. This BP is about formats for the data. 14. multiple formats -- The example says John decided to use XML, but it shows ttl, and it shows metadata, not data. The trend lately is toward doing a single format (json). Do we want to go against that trend? I note that the W3C's own API is json only. Intro to Vocabularies -- "possible subjects for books" is not a good example of a controlled vocabulary. There are two other good examples, so that could just be removed. The first paragraph seems to be suggesting that controlled vocabularies enable easy translation, but it's confusingly phrased. The last three sentences could be changed to read "Standardized vocabularies can also serve to improve the usability of datasets. Say a dataset contains a reference to a concept described in a vocabulary that has been translated into several languages. Such a reference allows applications to localize their display of the data depending on the language of the user." The last paragraph refers to "the former kind of vocabulary". It's not clear what kind that is. It's not clear what the point of that paragraph is. 15. standardized terms -- 16. Reuse vocabs -- This still seems too similar to BP 16. See my separate email about this one. 17. right formalization -- We say "a data publisher should opt for a level of formal semantics that fit data and applications." We don't want to tell people to use formal semantics, just to pick the right level or formalization. This should be changed to "a data publisher should opt for a level of semantic formalization that fits the data and its applications." The intended outcome doesn't specifically address the issue of formalization. The examples are about design of vocabularies, not about choosing them. The test is hard to do. Finding and implementing an inference engine seems beyond the scope of publishing data. Sensitive data intro -- the discussion of sensitive data still needs a disclaimer, and the text should me more general rather than focused only on personal privacy. 18. data unavailability -- "data unavailability reference" is awkard and unclear. Could we say "Provide an explanation for data that is not available." We should address testing machine-readability by saying that a legitimate http response code in the 400 or 500 range should be returned. Data acces intro -- We say the web uses http by default and then say that different approaches can be adopted, including bulk download and APIs. Bulk download and APIs, of tar files or anything else, both use HTTP! Next there is discussion of packaging in bulk using non-proprietary file formats (e.g., tar files). This has nothing to do with being nonproprietary. The point, I think, is archiving a directory structure into a single file. Paragraph 3 is tautological. Data that is already streaming to the web is already published in a manner that allows immediate access. I think we mean to say "For data that is generated in real time or near real time, data publishers should use an automated system to enable immediate access to time-sensitive data, such as emergency information, weather forecasting data, or system monitoring metrics. In general, APIs should be available to allow third parties to automatically search and retrieve such data." If you want to then talk a little about APIs for other kinds of data, you could add a paragraph that goes like this: "Aside from helping to automate real-time data pipelines, APIs are suitable for all kinds of data on the Web. Though they generally require more work than posting files for download, publishers are increasingly finding that delivering a well documented, standards-based, stable API is worth the effort." 19. bulk download -- The intended outcome is focused on the wrong thing. It says "Bulk download will enable large file transfers (which would require more time than a typical user would consider reasonable) by dedicated file-transfer protocols." That's true, but it's not the point of the BP. The idea of allowing bulk download applies to datasets that are smaller as well as larger ones, and it need not involve alternative protocols. The outcome we are hoping for is that people will be able to easily download the data with a single request. In the implementation section, the first bullet should clarify that it is about downloading. Making a request to one URI isn't unique to that bullet. (A bulk request to an API goes to one URI as well.) It should read "For datasets that exist initially as multiple files, preprocessing a copy of the data into a compressed archive format and making the data accessible for download from one URI." The test should be about whether the full dataset can be retrieved with a single request, not whether the data is preprocessed. That test works for APIs as well as file downloads by humans. 20. provide subsets -- Change "Static datasets that users in the domain would consider to be large will be downloadable in smaller pieces" to "Static datasets that take some time to download will be downloadable in smaller pieces" It's true that being large is dependent on what users in the domain consider to be large, but the issue here is time, not largeness. The second example needs to be updated to use the new text I provided, using CSV, as follows: "The MyCity transit agency has been collecting detailed data about passenger usage for several years. This is a very large dataset, containing values for numbers of passengers by transit type, route, vehicle, driver, entry stop, exit stop, transit pass type, entry time, etc. They have found that a wide variety of stakeholders are interested in downloading various subsets of the data. The folks who run each transit system want only the data for their transit mode, the city planners only want the numbers of entries and exits at each stop, the city budget office wants only the numbers for the various types of passes sold, and others want still different subsets. The agency created a web site where users can select which variables are of interest to them, set ranges on some variables, and download only the subset they need." How to test should say something about all the subets adding up to the complete set. Didn't we have a test before that the entire dataset can be recovered by making a series of smaller requests? I think we had a note that coming up with use cases isn't deterministic enough. 21. Content negotiation -- Content negotiation should be in the implementation section for BP 14, multiple formats, rather than its own BP. I think we've already agreed to change this, but I'll just reiterate that I'm not yet convinced that always using conneg is a best practice for serving multiple formats from an API. I like the use of file extensions, because they allow one to reference a resource as a URI instead of a URI plus required headers (plus a note explaining how to set headers). I also think it's good to allow tests of an API using a browser when possible. Since browsers don't let you set request headers, relying on conneg alone prevents that. Using both addresses most objections, but many people prefer conneg because it allows them to get file extensions out of URIs. Implementing both doesn't accomplish that. For file downloads, I think conneg is a worst practice, because browsers don't allow users to set headers. Anyway, we could argue a long time on this. There is still a lot of disagreement about this stuff. 22. Real-time access -- Subtitle: I still don't know what it means for data to be "produced in real time". The other day I posted some log data from a supercomputing system. That data is produced constantly, and it appears in the logs immediately when an event happens. That feels to me like real time, but I don't think it is appropriate to publish on the web in real time, because the purpose of posting is detailed analysis, not monitoring. On the other hand, preparing the log data for publishing is slow, so maybe that's the real measure. Maybe it should be "When data is released in real time . . ." The intended outcome defines near real time as with a predetermined delay. The U.S. Census has a predetermined delay of 10 years, and that is not near real time. See https://en.wikipedia.org/wiki/Real-time_computing#Near_real-time for some help. I don't understand the Push approach to implementation. I think the last word was intended to be publisher. "Disseminating" is vague and not particularly push-y, and making storage available is certainly not push-y. The last sentence of the implementation section is garbled. I think real-time data implementation is better broken into streaming or not streaming. It would be helpful to give some info about those alternatives. The example doesn't use the transport agency, and it doesn't show how to implement real-time data. It would be more appropriate as an example of an API. Mention of PROV-O in the test is unnecessary and off point. A more appropriate test might be to measure the refresh frequency and see that it matches the update frequency of the source data, and to measure the latency and see if it is in the real-time or near-real-time range. 23. up to date -- The Why text is unclear as to what type of coincidence is desired and what should coincide with what. Similar to the real-time BP, I think the issue here is that the publication frequency should match the release frequency. Anyway, the first sentence is not about why. It belongs in the intended outcome. I suggest the intended outcome says "Data on the web should be updated in a timely manner so that the most recent data available online reflects the most recent data released. When new data is released via any channel, it should be made available on the Web as soon as possible thereafter." We could use a transit example about real-time bus arrival predictions. The first sentence of the test reads like a note to ourselves to write a test. One step is "publish an updated version of data." That is not something one can do whenever a test is needed. More importantly, that test only determines whether there is a difference between two versions of the data. What it should be testing is the timeliness of the most recent data. The test could be to check that the update frequency is stated and that the most recently published copy on the Web is no older than the stated update frequency. 24. use APIs -- The test should say that a test client can simulate calls and the API returns the expected responses. (The test client doesn't simulate the responses.) 25. web standard APIs -- We should have some references for REST. Richardson, L. and Sam Ruby, RESTful Web Services, O'Reilly, 2007, http://restfulwebapis.org/rws.html. Fielding, Roy T., "Representational State Transfer (REST)", Chapter 5 of Architectural Styles and the Design of Network-based Software Architectures, Ph.D. Dissertation, University of California, Irvine, 2000, https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm. 26. document APIs -- The intended outcome has a run-on sentence. The last sentence of the outcome starts with "It", which lacks an antecedent. (What is the "it"?) The examples are all spatial data examples. None of them really makes sense in this context. We should probably offer examples for the transport agency. Can we use a test like "time to first successful call"? That would require having volunteers to learn to use the API and timing them. 27. avoid breaking APIs -- In the implementation section, "home resource URI" gets used as a plural, but an API should only have one. Remove "home" in the first one and it makes more sense. "...by keeping resource URIs constant..." The bit about announcing changes should go in the outcome section. 28. assess coverage -- This is about archiving, not publishing, and is out of scope. I disagree that the meaning of a resource is dependent on the rest of the global graph. The web was invented to enable links to be added without changing the thing linked to. What the resource itself links to affects its meaning, however. The test is all about developing a score, which I guess is for determining what to archive? That doesn't seem to be about publishing. 29. trusted serialization -- If we keep this, it should at least offer JSON as an acceptable example. JSON is the current overwhelming standard for APIs. This talks about "sending data dumps for long-term preservation" and "data depositors". Where are the data being sent? Is it on the Web? The bad example would pass the How to Test. 30. update identifiers -- It's not quite clear what we are suggesting get linked to what. The Why talks about linking preserved datasets with the original URI. Are we saying the original URI should continue to point to the preserved dataset? If that's the case, then what does preservation mean? There is also discussion of saving snapshots as versions, which seems to me is covered better under versioning. We say "A link is maintained between the URI of a resource, the most up-to-date description available for it, and preserved descriptions." One link can only join two resources. Should people preserve old descriptions? Maybe descriptions of older versions are what was meant? A 410 status only makes sense if there's nothing served at the URI, which isn't the case if the advice here is followed. 303 seems like a good option. Feedback intro -- Second paragraph: The word blog is misused. A blog is a web site for one person to serially publish comments, not for many people to enter single comments. Also, I disagree with this sentence: "In order to quantify and analyze usage feedback, it should be recorded in a machine-readable format." I think using automated tools to gather feedback and store it in a searchable way is a good idea, but saying the feedback should be machine readable is misleading and insufficiently specific. If you have succeeded in posting feedback on the web, it is machine readable by definition. It sounds like we are telling people to publish their feedback as another dataset. You may want to store it in a machine-readable way for the purpose of displaying it to other humans, but there's no reason to *publish* the feedback with machines in mind. 31. Gather feedback -- This BP includes recommendations about making feedback public, but that's handled in the next BP. We should keep this BP focused on enabling feedback. The first sentence of the Why needs rewriting. We should remove the word "providing" at the beginning. The BP is about collecting feedback, not providing it. It should address the value of setting up a specific way of collecting feedback (makes it easier for consumers to contribute). Remove the mention of machine-readable formats and using a vocabulary for capturing the semantics of the feedback information. Instead, suggest using an automated feedback system, such as a bug tracker. How to test, the first bullet is a note to us, I guess. The second is partially about the next BP. The third is again treating the feedback data as another published dataset. There's nothing wrong with publishing such a dataset, but that's not the idea here. A real test would be whether a consumer is able to find a way to provide feedback. 32. Publish feedback The subtitle should say "Feedback should be available for human users." There is no expectation that feedback be provided as a dataset for consumption by systems outside the publisher. (It takes a bit of trust for a publisher to put the info out for human consumption, and making it processable by external systems can be seen as breaching that trust.) The Why should mention avoiding duplication and being transparent about the quality of the data. The intended outcome is tautological. It should include the idea that consumers should be able to review issues already raised by others, saving them the trouble of filing duplicate bug reports. Publishing feedback also helps consumers understand any issues that may affect their ability to use the data. The implementation section need to be changed. We should not be telling people that they need to present their feedback in machine readable form. The test is again about metadata for the feedback as a dataset. Publishing your feedback as a dataset is not a best practice. 33. enrich data -- I'm concerned that we need to be more explicit about the pitfalls of generating new values. For scientific data, this is a VERY touchy subject. People have lost jobs and scientific careers by doing that. The Why needs a few caveats. "Under some circumstances, missing values can be filled in, and ..." "Publishing more complete datasets can enhance trust, if done properly and ethically." In the intended outcome, "should be enhanced if possible" is too strong. The first paragraph could be "Data that is unstructured should be given structure if possible. Additional derived measures or attributes should be added if they enhance utility. A dataset that has missing values can be enhanced to fill in those values if the addition does not distort analytical results, significance, or statistical power." 37. cite source -- The first line of the example ("You can cite the original...") should replace the text above it ("You can use the Dataset Usage...") The example citation should list the transit agency as the author. 'Data source: MyCity Transport Agency, "Bus Timetable of MyCity...' glossary -- The link to CITO under "citation" doesn't go directly to cito. It should probably go to http://www.sparontologies.net/ontologies/cito/source.html The definition of locale needs to mention geographic location. The definition of machine readable data surprises me. I think proprietary formats are machine readable, too. If we want to steer people away from proprietary formats, we should do that explicitly. Challenges -- In the diagram, the challenge texts should be similar, either all statements or all questions. Suggestion for the reuse one: "How can I reuse responsibly?" (The current question sounds a little too self-serving.) -- Annette Greiner NERSC Data and Analytics Services Lawrence Berkeley National Laboratory
Received on Thursday, 21 April 2016 01:20:16 UTC