Re: Proposals for Annette's comments to be considered before publishing the last working draft from Annette Greiner on 2016-04-26 (public-dwbp-wg@w3.org from April 2016)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Tue, 26 Apr 2016 13:32:32 -0700
To: Bernadette Farias Lóscio <bfl@cin.ufpe.br>, "public-dwbp-wg@w3.org" <public-dwbp-wg@w3.org>
Message-ID: <571FD060.3050708@lbl.gov>
Hi Berna,
Here are my notes below. Thanks for your efforts!
-Annette

On 4/26/16 8:10 AM, Bernadette Farias Lóscio wrote:
> Hi Annette,
>
> Thanks a lot for your feedback and the great discussions about the 
> DWBP document!
>
> We already resolved a lot of your comments [3], but we still have some 
> to discuss. We'd like to ask you to take a look in the following 
> comments [3] and tell us if you agree with our proposals described below:
>
> 23 (Introduction):
>
> Phil made the native-speaker review. Phenomenon was removed. We 
> propose to keep the examples [1].
We need to use examples that are examples of the thing we are talking 
about, which is the expansion of the Web as a medium for the exchange of 
data. These examples don't represent use of the web per se, though they 
are things that could drive more usage of the web, if people decided to 
do that. The worst offender in this regard is "the provision of 
important cultural heritage collections". Important cultural heritage 
collections have been around for millennia. That only works as an 
example if it refers to putting those collections on the web.

A few grammatical edits:
In paragraph 3, " how to represent, describe and make data available" - 
parallelism is off
We need a comma after " For more details about the challenges"
In paragraph 4, change " &applications" to "and applications"
change "among the users of these communities" to "among the users in 
these communities"
change " domain & application independent" to " domain and application 
independent"
change " Whilst DWBP recommends the use of Linked Data," to " While DWBP 
recommends the use of Linked Data,"
Last paragraph, " benefits were set:" is awkward. How about "we 
delineated a list of benefits, including comprehension, ..." (no 
semicolons).
>
> 27 (Context): Eric helped us to rewrite the diagram description:
>
> The following is a composite diagram illustrating the anatomy of a 
> published and acessible Web dataset. Data values correspond to the 
> data itself and may be available in one or more distributions, which 
> should be defined by the publisher considering data consumer's 
> expectations. The Metadata component corresponds to the additional 
> information that describes the dataset and dataset distributions, 
> helping consumers manipulate and reuse the data. In order to allow 
> easy access to the dataset and its corresponding distributions, 
> multiple dataset access mechanisms can  be available. Finally, to 
> promote the interoperability among datasets it is important to adopt 
> data vocabularies and standards.
>
Eric's description is very helpful in understanding the right side of 
the figure, and I think the right-hand side is helpful, but the 
left-hand side is still not working for me.  The colored rectangles are 
very abstract concepts, and representing them in this way doesn't make 
them less abstract. Also, if you inserted the details of the 
distributions into the dataset, you would have metadata represented at 
two different levels. It's not clear to me why that choice was made, but 
it seems to suggest that there is metadata for the dataset that isn't to 
be included in the distributions. It also appears that the concept of a 
dataset only exists before it is distributed. Is the left side about 
storage of the data? If so, then the colored rectangles make little 
sense being there. I think the goal of the diagram was to explain the 
relationship between datasets, distributions, data, and metadata. If it 
concentrated on those elements, it would be more useful.
> 39 (Sensitive data):
>
> How to test section:
>
> Check if the dataset includes references to other data that is 
> unavailable in a human-readable way.
>
It doesn't strictly make sense to say that something is "unavailable in 
a human-readable way". Also, what is a pass for this test? Maybe this is 
the idea: "Where the dataset includes references to data that is no 
longer available or is not available to all users, check that an 
explanation of what is missing and instructions for obtaining access (if 
possible) are given."
> Check if a legitimate http response code in the 400 or 500 range is 
> returned when trying to get unavailable data.
>
good
> 56 (Subsets for Large Datasets)
>
> We'd like to ask you to rewrite the intended outcome because it should 
> be about "What it should be possible to do when a data publisher 
> follows the Best Practice". It would be better to not have very long 
> intended outcomes.
How about "Both human users and applications should be able to access 
subsets of a dataset, rather than the entire thing, as needed. Available 
subsets should maximize the ratio of needed data to unneeded data in 
responses to consumer requests. Static file downloads should be kept to 
reasonable download times, and APIs should return results of appropriate 
granularity to suit the domain and Web application performance."
>
> 64 (feedback):
>
> Remove the sentence from the introduction: "In order to quantify and 
> analyze usage feedback, it should be recorded in a machine-readable 
> format. "
>
good

> 65 (feedback):
>
> Why:
>
> "Giving feedback to data publishers contributes to improving the 
> quality of published data, may encourage publication of new data, ..."
No, "giving" has the same issue as "providing". We should be explaining 
why the publisher wants to get feedback, not why the user wants to 
supply it. How about this:
"Obtaining feedback helps publishers understand the needs of their data 
consumers and can help them improve the quality of their published data. 
It also enhances trust by showing consumers that the publisher cares 
about addressing their needs. Specifying a clear feedback mechanism 
removes the barrier of having to search for a way to provide feedback."
>
> Approach to implementation:
>
> "Provide data consumers with one or more feedback mechanisms 
> including, but not limited to: a registration form, contact form, 
> point and click data quality rating buttons, or a comment box for 
> blogging.
No colon here, that should be a comma. A registration form is not a good 
feedback means, because that would require one to register or even 
re-register to provide feedback. The word "blogging" is still misused 
here. Filling in a comment is not blogging.
>
> In order to quantify and analyze feedback received from consumers, 
> store feedback in machine-readable. The Dataset Usage Vocabulary 
> [VOCAB-DUV <http://w3c.github.io/dwbp/bp.html#bib-VOCAB-DUV>] is 
> desigend specifically for this purpose.
>
I don't think that one would use DUV for the storage. Wouldn't one put 
it in a database and then, when there is some need to express it with 
the DUV, add the DUV vocabulary to how it is presented? It wouldn't be 
efficient to store all the DUV terms in the database. If I understand it 
correctly, DUV doesn't provide semantic markup for the feedback itself, 
just expressing the motivation and that it is feedback on the particular 
dataset. To be DUV friendly, then, really you just need to remember to 
also store the motivation (either editing, classifying [rating] 
commenting or questioning).
How about
"In order to make the most of feedback received from consumers, it's a 
good idea to collect the feedback with a tracking system that captures 
each item in a database, enabling quantification and analysis. It is 
also a good idea to capture the type of each item of feedback, i.e., its 
motivation (editing, classifying [rating], commenting or questioning), 
so that each item can be expressed using the Dataset Usage Vocabulary."
>
> How to test:
>
> Check if there is at least one feedback mechanism available for data 
> consumers.
>
Well, it has to be discoverable. How about
"Check that at least one feedback mechanism is provided and readily 
discoverable by data consumers."
>
> 67 (data enrichment):
>
> Why:
>
> Enrichment can greatly enhance processability, particularly for 
> unstructured data. Under some circumstances, missing values can be 
> filled in, and new attributes and measures can be added. Publishing 
> more complete datasets can enhance trust, if done properly and 
> ethically. Deriving additional values that are of general utility 
> saves users time and encourages more kinds of reuse. There are many 
> intelligent techniques that can be used to enrich data, making the 
> dataset an even more valuable asset.
>
> Intended Outcome:
>
> We'd like to ask you to rewrite the intended outcome because it should 
> be about "What it should be possible to do when a data publisher 
> follows the Best Practice". It would be better to not have very long 
> intended outcomes.
>
How about
"Data that is unstructured should be given structure if possible. In 
structured data, missing values should be added if they enhance utility, 
but only if the addition does not distort analytical results, 
significance, or statistical power. Values generated by inference-based 
techniques should be labeled as such, and it should be possible to 
retrieve any original values replaced by enrichment. Whenever licensing 
permits, the code used to enrich the data should be made available along 
with the dataset."
>
> 68 (glossary)
>
> Locale parameters: A locale is a set of parameters that defines 
> specific data aspects, such as language and formatting used for 
> numeric values, dates and geographic locations.
>

No, a locale is not a set of parameters. How about
"Locale parameters: A set of parameters that clarifies aspects of the 
data that may be interpreted differently in different geographic 
locations, such as language and formatting used for numeric values or 
dates."
>
> Machine-readable: A format in a standard computer language (not 
> natural language text) that can be read automatically by a computer 
> system. Traditional word processing documents and portable document 
> format (PDF) files are easily read by humans but typically are 
> difficult for machines to interpret. Formats such as XML, JSON, 
> NetCDF, RDF or spreadsheets with header columns that can be exported 
> as CSV are machine readable formats.
>
> This definition of machine-readable was proposed by Phil and it is 
> from [2].
>
I disagree with the word "language" here, as a computer language usually 
refers to a programming language, like C++ or Java.

How about
"Machine-readable data: Data in a standard format that can be read and 
processed automatically by a computing system. Traditional word 
processing documents and portable document format (PDF) files are easily 
read by humans but typically are difficult for machines to interpret and 
manipulate. Formats such as XML, JSON, HDF5, RDF and CSV are 
machine-readable data formats."
>
> 69 (license):
>
> Could you contact Renato Ianella? Do you have any updates about this 
> comment?
>
I think I understand what Renato is after. He is pointing out that for 
ODRL, they pretty much avoided using the word "license" altogether. For 
the verb, they use "grantUse" (though, I don't think we have the option 
of using that term in our text, since it's not in standard English in 
any side of the Atlantic), and for the noun they use "agreement". I'm 
sure there are many (of the 66) places in our text where "agreement" 
would work. We could read through and look for opportunities to 
substitute "agreement" for the noun "license". We would still have to 
use "license" for the verb and for the noun in places where "agreement" 
didn't provide enough context.
>
> ------------------------
>
> Comments 61, 62 and 63 were not addressed yet :( We need help from the 
> group to resolve them.
>
>
> kind regards,
> BP Editors
>
> [1] http://w3c.github.io/dwbp/bp.html#intro
> [2] https://en.wikipedia.org/wiki/Machine-readable_data
> [3] 
> https://www.w3.org/2013/dwbp/wiki/Comments_to_be_considered_before_publishing_the_last_working_draft
> -- 
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
> ----------------------------------------------------------------------------

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
Received on Tuesday, 26 April 2016 20:33:06 UTC