Re: Comments and questions about Data Access BP from Annette Greiner on 2016-04-08 (public-dwbp-wg@w3.org from April 2016)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Fri, 8 Apr 2016 14:20:36 -0700
To: Bernadette Farias Lóscio <bfl@cin.ufpe.br>
Cc: "public-dwbp-wg@w3.org" <public-dwbp-wg@w3.org>
Message-ID: <570820A4.3050902@lbl.gov>
Hi Bernadette,
Here is a different example for the subsetting BP.

The MyCity transit agency has been collecting detailed data about 
passenger usage for several years. This is a very large dataset, 
containing values for numbers of passengers by transit type, route, 
vehicle, driver, entry stop, exit stop, transit pass type, entry time, 
etc.  They have found that a wide variety of stakeholders are interested 
in downloading various subsets of the data. The folks who run each 
transit system want only the data for their transit mode, the city 
planners only want the numbers of entries and exits at each stop, the 
city budget office wants only the numbers for the various types of 
passes sold, and others want still different subsets. The agency created 
a web site where users can select which variables are of interest to 
them, set ranges on some variables, and download only the subset they need.

What do you think?

On 4/8/16 5:33 AM, Bernadette Farias Lóscio wrote:
> Hi Annette,
>
> Thank you for your helpful review and comments! I already made some 
> updates, but I still have some comments.
>
>>     1. Introduction
>>
>>     I’m not sure if the following paragraph fits in this section:
>>
>>     On a further note, it can be observed that data on the Web is
>>     essentially about the description of entities identified by a
>>     unique, Web-based, identifier (an URI). Once the data is dumped
>>     and sent to an institute specialised in digital preservation the
>>     link with the Web is broken (dereferencing) but the role of the
>>     URI as a unique identifier still remains. In order to increase
>>     the usability of preserved dataset dumps it is relevant to
>>     maintain a list of these identifiers.
>
>     I agree. I don't think that fits.
>
>
>
> ---> Removed!
>
>>
>>     2. BP 19 Provide bulk download
>>
>>     Data or datasets should be available for bulk download? I think
>>     the BP should refer to datasets instead of data. I think the
>>     meaning of bulk download should be more clear.
>     I think "datasets" is fine, as you suggest.
>
>
> ---> done!
>
>
>>
>>     I don’t understand this phrase: “When Web data is distributed
>>     across many URIs but might logically be organized as one
>>     container, accessing the data in bulk can be useful." Again, I
>>     think the BP should consider datasets instead of data.
>     As I understand it, the idea is that, if you have data that would
>     logically be organized as a dataset but it is spread over multiple
>     endpoints (for example, it's available piecewise through an API or
>     through subsets for download), so that getting a copy of the
>     entire dataset would require multiple requests, that would be a
>     pain in the neck to reassemble as the complete dataset. Since it's
>     referring to the dataset being broken up, "data" makes more sense.
>     Does it help to s/container/dataset/?
>
>
> ---> I see Annette! I think dataset is better than container.
>
>
>>
>>     I’m not sure if I understood the example. Is one dataset with
>>     multiple CSV files? or multiple datasets each one with a CSV
>>     distribution? The bulk download contains one dataset or multiple
>>     datasets?
>     It's probably best to think of it as one dataset with multiple CSV
>     files. The bulk download contains one dataset. But the definition
>     of a dataset is pretty flexible, and one person's dataset is
>     another person's collection or subset, so the term "dataset" can
>     be confusing in this context.
>
>
> ---> I understand, but I'm not sure if this clear for the public. 
> Let's keep like this and let's try to have some feedback from the 
> community.
>
>>
>>     3. Best Practice 20: Provide Subsets for Large Datasets
>>
>>     In the example, can we use CSV format instead of PDF format?
>     I was trying to keep it realistic, thinking of what transit
>     agencies really do. I suppose we could use CSV, but it would be
>     less realistic. I think PDF is fine in addition to having an API.
>
>
> ---> I think it would be better to use CSV than PDF because we are 
> always talking about machines be able to process the data. In this 
> case, CSV is better, no?
>
>>
>>     R-Citable is an evidence for this BP?
>     Having a separate URI for the subset makes the subset citable.
>
>
> ---> Ok! I agree!
>
>
>>
>>     4. BP 23 Provide data up to date
>>
>>     The description of BP 23 says: “Data must be available in an
>>     up-to-date manner and the update frequency made explicit. " But
>>     the BP doesn’t mention how to make the update frequency
>>     available. I suggest to remove   “and the update frequency made
>>     explicit" from the description.
>     Yeah, the update frequency often is not predictable. I do like the
>     idea of reporting the frequency when it is known. If we don't have
>     a recommendation about how to do that, I think we can still
>     suggest that people do it.
>     It looks like DCAT found a way of doing that in machine-readable
>     form [1], though the link resolves to a page that doesn't look
>     very official. If nothing else, one can include a textual
>     statement in the documentation.
>
>
> ---> I think we should rewrite this BP to make this more explicit.
>
>>
>>     5. BP 25 : Use Web Standards as the foundation of your API"
>>     Is possible to rewrite the description of the BP to make the text
>>     smaller? In general, BP descriptions are one or two lines.
>>
>     I agree it's awfully long. I'd suggest
>     "When designing APIs, use an architectural style that is founded
>     on the technologies of the Web itself."
>
>     If some people insist that we need to list the technologies, we
>     could say
>     "When designing APIs, use an architectural style that is founded
>     on the technologies of the Web itself, such as URIs, HTTP verbs,
>     HTTP response codes, MIME types, typed HTTP Links, and content
>     negotiation."
>
> ---> I used the first one!
>
>>     I’m not sure if the example is suitable for this BP. Maybe the
>>     example needs a better explanation or the BP needs a better
>>     example :)
>     That example shows what makes a hypermedia API a hypermedia API. I
>     would want to keep that but maybe add an example for REST more
>     generally. It's difficult for me to think of a way to show an
>     example of a REST API, though, other than linking to one (possibly
>     https://w3c.github.io/w3c-api/). Or do we want to build and host a
>     little example REST API for the transit agency?
>
>
> ---> It would be great if we could build a little example REST API for 
> the transit agency. Is it possible?
>
>>
>>     The same for the the How to test section: “Check that the service
>>     avoids using http as a tunnel for calls to custom methods, and
>>     check that URIs do not contain method names”. I don’t see how
>>     this is a test about using Web standards.
>     The way to implement a nonstandard architecture on the web is to
>     hide it within standard calls. Using http as a tunnel for custom
>     methods rather than using http itself is symptomatic of not using
>     http for anything other than a transport mechanism. URIs that
>     contain method names are a dead giveaway that one is inventing new
>     methods rather than using http verbs and URIs.
>
>
> ---> Ok Annette! Thanks a lot for the explanation. I'm learning a lot 
> about APIs :)
>
>>
>>     6. BP 26: Provide complete documentation for your API
>>
>>     It would be better if the example of this BP should be related
>>     with the bus stops example.
>
>     I agree. Maybe we need to implement an example transit API doc
>     site in Swagger or something. If we want an equally nice example
>     as the pet store one, that's not trivial.
>
>
> ---> Again, it would be great to have an example for the Transit 
> Agency. Let's see if we can work on that.
>
>>
>>     I think the following phrases should be on the approach to
>>     implementation and not on the how to test section: “The quality
>>     of documentation is also related to usage and feedback from
>>     developers. Try to get constant feedback from your users about
>>     the documentation."
>     I agree.
>
>
> ---> ok! moved!
>
>>
>>     7. BP 27 Avoid Breaking Changes to Your API
>>
>>     The how to test section  seems more like an approach to
>>     implementation than to a test. Is it possible to rewrite?
>     I disagree. The bit about testing shows how to test that changes
>     to the API do not break it, which is not the same as showing how
>     to implement changes to the API. It is literally how to test it.
>
>
> ok! Now I undesrtand and I agree with you! Let's keep the original test ;)
>
>>
>>     It would be great to have an example that also uses the bus stop
>>     dataset. Maybe the example of BP 27 can be related with the
>>     example of BP 26.
>     Maybe we could add something like this:
>
>     Suppose the MyCity transit agency's API responds to a request for
>     a certain bus's arrival time at a single station as
>     http://api.mycitytransit.example.org/arrivals/buses/53/stop/12,
>     but the agency decides it wants to make it possible to query for a
>     range of stops at once. Rather than change the form of the request
>     to require a range, like
>     http://api.mycitytransit.example.org/arrivals/buses/53/stop/12-12,
>     the agency can keep the old API call and add a new one for
>     multiple arrivals, like
>     http://api.mycitytransit.example.org/arrivals/buses/53/stops/1-12.
>
>
> Nice! Example added!
>
> Just summarizing, let's see if:
> - we can improve the BP Provide data up to date
> - we can add examples for BP 25 and BP 26 using the transit agency example
>
> Thanks a lot!
> Berna
>
>>
>>     Thanks a lot!
>>     Bernadette
>>
>>
>>     -- 
>>     Bernadette Farias Lóscio
>>     Centro de Informática
>>     Universidade Federal de Pernambuco - UFPE, Brazil
>>     ----------------------------------------------------------------------------
>     [1] https://www.w3.org/TR/vocab-dcat/ " In order to express
>     frequency of update in the example above, we chose to use an
>     instance from the Content-Oriented Guidelines
>     <http://www.w3.org/TR/vocab-data-cube/#dsd-cog> developed as part
>     of the W3C Data Cube Vocabulary efforts."
>
>     -- 
>     Annette Greiner
>     NERSC Data and Analytics Services
>     Lawrence Berkeley National Laboratory
>
>
>
>
> -- 
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
> ----------------------------------------------------------------------------

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
Received on Friday, 8 April 2016 21:21:11 UTC