- From: Annette Greiner <amgreiner@lbl.gov>
- Date: Tue, 26 Apr 2016 13:32:32 -0700
- To: Bernadette Farias Lóscio <bfl@cin.ufpe.br>, "public-dwbp-wg@w3.org" <public-dwbp-wg@w3.org>
- Message-ID: <571FD060.3050708@lbl.gov>
Hi Berna, Here are my notes below. Thanks for your efforts! -Annette On 4/26/16 8:10 AM, Bernadette Farias Lóscio wrote: > Hi Annette, > > Thanks a lot for your feedback and the great discussions about the > DWBP document! > > We already resolved a lot of your comments [3], but we still have some > to discuss. We'd like to ask you to take a look in the following > comments [3] and tell us if you agree with our proposals described below: > > 23 (Introduction): > > Phil made the native-speaker review. Phenomenon was removed. We > propose to keep the examples [1]. We need to use examples that are examples of the thing we are talking about, which is the expansion of the Web as a medium for the exchange of data. These examples don't represent use of the web per se, though they are things that could drive more usage of the web, if people decided to do that. The worst offender in this regard is "the provision of important cultural heritage collections". Important cultural heritage collections have been around for millennia. That only works as an example if it refers to putting those collections on the web. A few grammatical edits: In paragraph 3, " how to represent, describe and make data available" - parallelism is off We need a comma after " For more details about the challenges" In paragraph 4, change " &applications" to "and applications" change "among the users of these communities" to "among the users in these communities" change " domain & application independent" to " domain and application independent" change " Whilst DWBP recommends the use of Linked Data," to " While DWBP recommends the use of Linked Data," Last paragraph, " benefits were set:" is awkward. How about "we delineated a list of benefits, including comprehension, ..." (no semicolons). > > 27 (Context): Eric helped us to rewrite the diagram description: > > The following is a composite diagram illustrating the anatomy of a > published and acessible Web dataset. Data values correspond to the > data itself and may be available in one or more distributions, which > should be defined by the publisher considering data consumer's > expectations. The Metadata component corresponds to the additional > information that describes the dataset and dataset distributions, > helping consumers manipulate and reuse the data. In order to allow > easy access to the dataset and its corresponding distributions, > multiple dataset access mechanisms can be available. Finally, to > promote the interoperability among datasets it is important to adopt > data vocabularies and standards. > Eric's description is very helpful in understanding the right side of the figure, and I think the right-hand side is helpful, but the left-hand side is still not working for me. The colored rectangles are very abstract concepts, and representing them in this way doesn't make them less abstract. Also, if you inserted the details of the distributions into the dataset, you would have metadata represented at two different levels. It's not clear to me why that choice was made, but it seems to suggest that there is metadata for the dataset that isn't to be included in the distributions. It also appears that the concept of a dataset only exists before it is distributed. Is the left side about storage of the data? If so, then the colored rectangles make little sense being there. I think the goal of the diagram was to explain the relationship between datasets, distributions, data, and metadata. If it concentrated on those elements, it would be more useful. > 39 (Sensitive data): > > How to test section: > > Check if the dataset includes references to other data that is > unavailable in a human-readable way. > It doesn't strictly make sense to say that something is "unavailable in a human-readable way". Also, what is a pass for this test? Maybe this is the idea: "Where the dataset includes references to data that is no longer available or is not available to all users, check that an explanation of what is missing and instructions for obtaining access (if possible) are given." > Check if a legitimate http response code in the 400 or 500 range is > returned when trying to get unavailable data. > good > 56 (Subsets for Large Datasets) > > We'd like to ask you to rewrite the intended outcome because it should > be about "What it should be possible to do when a data publisher > follows the Best Practice". It would be better to not have very long > intended outcomes. How about "Both human users and applications should be able to access subsets of a dataset, rather than the entire thing, as needed. Available subsets should maximize the ratio of needed data to unneeded data in responses to consumer requests. Static file downloads should be kept to reasonable download times, and APIs should return results of appropriate granularity to suit the domain and Web application performance." > > 64 (feedback): > > Remove the sentence from the introduction: "In order to quantify and > analyze usage feedback, it should be recorded in a machine-readable > format. " > good > 65 (feedback): > > Why: > > "Giving feedback to data publishers contributes to improving the > quality of published data, may encourage publication of new data, ..." No, "giving" has the same issue as "providing". We should be explaining why the publisher wants to get feedback, not why the user wants to supply it. How about this: "Obtaining feedback helps publishers understand the needs of their data consumers and can help them improve the quality of their published data. It also enhances trust by showing consumers that the publisher cares about addressing their needs. Specifying a clear feedback mechanism removes the barrier of having to search for a way to provide feedback." > > Approach to implementation: > > "Provide data consumers with one or more feedback mechanisms > including, but not limited to: a registration form, contact form, > point and click data quality rating buttons, or a comment box for > blogging. No colon here, that should be a comma. A registration form is not a good feedback means, because that would require one to register or even re-register to provide feedback. The word "blogging" is still misused here. Filling in a comment is not blogging. > > In order to quantify and analyze feedback received from consumers, > store feedback in machine-readable. The Dataset Usage Vocabulary > [VOCAB-DUV <http://w3c.github.io/dwbp/bp.html#bib-VOCAB-DUV>] is > desigend specifically for this purpose. > I don't think that one would use DUV for the storage. Wouldn't one put it in a database and then, when there is some need to express it with the DUV, add the DUV vocabulary to how it is presented? It wouldn't be efficient to store all the DUV terms in the database. If I understand it correctly, DUV doesn't provide semantic markup for the feedback itself, just expressing the motivation and that it is feedback on the particular dataset. To be DUV friendly, then, really you just need to remember to also store the motivation (either editing, classifying [rating] commenting or questioning). How about "In order to make the most of feedback received from consumers, it's a good idea to collect the feedback with a tracking system that captures each item in a database, enabling quantification and analysis. It is also a good idea to capture the type of each item of feedback, i.e., its motivation (editing, classifying [rating], commenting or questioning), so that each item can be expressed using the Dataset Usage Vocabulary." > > How to test: > > Check if there is at least one feedback mechanism available for data > consumers. > Well, it has to be discoverable. How about "Check that at least one feedback mechanism is provided and readily discoverable by data consumers." > > 67 (data enrichment): > > Why: > > Enrichment can greatly enhance processability, particularly for > unstructured data. Under some circumstances, missing values can be > filled in, and new attributes and measures can be added. Publishing > more complete datasets can enhance trust, if done properly and > ethically. Deriving additional values that are of general utility > saves users time and encourages more kinds of reuse. There are many > intelligent techniques that can be used to enrich data, making the > dataset an even more valuable asset. > > Intended Outcome: > > We'd like to ask you to rewrite the intended outcome because it should > be about "What it should be possible to do when a data publisher > follows the Best Practice". It would be better to not have very long > intended outcomes. > How about "Data that is unstructured should be given structure if possible. In structured data, missing values should be added if they enhance utility, but only if the addition does not distort analytical results, significance, or statistical power. Values generated by inference-based techniques should be labeled as such, and it should be possible to retrieve any original values replaced by enrichment. Whenever licensing permits, the code used to enrich the data should be made available along with the dataset." > > 68 (glossary) > > Locale parameters: A locale is a set of parameters that defines > specific data aspects, such as language and formatting used for > numeric values, dates and geographic locations. > No, a locale is not a set of parameters. How about "Locale parameters: A set of parameters that clarifies aspects of the data that may be interpreted differently in different geographic locations, such as language and formatting used for numeric values or dates." > > Machine-readable: A format in a standard computer language (not > natural language text) that can be read automatically by a computer > system. Traditional word processing documents and portable document > format (PDF) files are easily read by humans but typically are > difficult for machines to interpret. Formats such as XML, JSON, > NetCDF, RDF or spreadsheets with header columns that can be exported > as CSV are machine readable formats. > > This definition of machine-readable was proposed by Phil and it is > from [2]. > I disagree with the word "language" here, as a computer language usually refers to a programming language, like C++ or Java. How about "Machine-readable data: Data in a standard format that can be read and processed automatically by a computing system. Traditional word processing documents and portable document format (PDF) files are easily read by humans but typically are difficult for machines to interpret and manipulate. Formats such as XML, JSON, HDF5, RDF and CSV are machine-readable data formats." > > 69 (license): > > Could you contact Renato Ianella? Do you have any updates about this > comment? > I think I understand what Renato is after. He is pointing out that for ODRL, they pretty much avoided using the word "license" altogether. For the verb, they use "grantUse" (though, I don't think we have the option of using that term in our text, since it's not in standard English in any side of the Atlantic), and for the noun they use "agreement". I'm sure there are many (of the 66) places in our text where "agreement" would work. We could read through and look for opportunities to substitute "agreement" for the noun "license". We would still have to use "license" for the verb and for the noun in places where "agreement" didn't provide enough context. > > ------------------------ > > Comments 61, 62 and 63 were not addressed yet :( We need help from the > group to resolve them. > > > kind regards, > BP Editors > > [1] http://w3c.github.io/dwbp/bp.html#intro > [2] https://en.wikipedia.org/wiki/Machine-readable_data > [3] > https://www.w3.org/2013/dwbp/wiki/Comments_to_be_considered_before_publishing_the_last_working_draft > -- > Bernadette Farias Lóscio > Centro de Informática > Universidade Federal de Pernambuco - UFPE, Brazil > ---------------------------------------------------------------------------- -- Annette Greiner NERSC Data and Analytics Services Lawrence Berkeley National Laboratory
Received on Tuesday, 26 April 2016 20:33:06 UTC