More Comments on the 9 April version of the BP doc from Phil Archer on 2016-04-19 (public-dwbp-wg@w3.org from April 2016)

From: Phil Archer <phila@w3.org>
Date: Tue, 19 Apr 2016 14:29:50 +0100
To: Bernadette Farias Lóscio <bfl@cin.ufpe.br>
Cc: Public DWBP WG <public-dwbp-wg@w3.org>
Message-ID: <571632CE.2050505@w3.org>
And I realised I've not sent these yet...


#GatherFeedback
===============

I suggest the intended outcome can be:

Publishers will receive feedback about their data.

In a couple of places, this BP talks about registration. That may be OK 
in some circumstances but if you have to register then the data isn't 
open. That doesn't make it out of scope for us, but registration is an 
anti-pattern for shared data.

This sentence:
<p>Collect feedback in machine-readable formats to represent the 
feedback and use a vocabulary to capture the semantics of the feedback 
information.</p>

Looks like a good place to cite the DUV, as in:

<p>Collect feedback in machine-readable formats to represent the 
feedback and use a vocabulary to capture the semantics of the feedback 
information. The Dataset Usage Vocabulary [[VOCAB-DUV]] provides a 
mechanism to expose feedback in an ordered way.</p>

I know this is mentioned in the next BP but it won't hurt to mention it 
again here.

#FeedbackInformation
====================

In sentences like "Making feedback about datasets and distributions 
publicly available..." I don't think we need to mention distributions. 
It's enough just to say "Making feedback about datasets publicly 
available..."

A lot of that paragraph repeats what was in the intro material for this 
section. Therefore I suggest it can be removed from the intro.

Intended outcome could be:

Enable the sharing of feedback given by different data consumers.


#EnrichData
===========
Again, there's a lot of info in the intended outcome that could go in 
the main text. I'd do it this way:

   <p class="subhead">Why</p>
   <p>Enrichment can greatly enhance processability, particularly for 
unstructured data. Missing values can be filled in, and new attributes 
and measures can be added. Publishing more complete datasets enhances 
trust. Deriving additional values that are of general utility saves 
users time and encourages more kinds of reuse. There are many 
intelligent techniques that can be used to enrich data, making the 
dataset an even more valuable asset.</p>
<p>A dataset that has missing values is enhanced if it is possible to 
fill in those values. Additional relevant measures or attributes should 
be added if they enhance utility. Unstructured data can be given 
structure in this way as well.</p>
<p>Because inference-based enrichment may introduce errors into the 
data, values generated by such techniques should be labeled as such, and 
it should be possible to retrieve any original values replaced by 
enrichment.</p>
<p>Whenever licensing permits, the code used to enrich the data should 
be made available along with the dataset. Sharing such code is 
particularly important for scientific data. </p>
</section>
<section class="outcome">
   <p class="subhead">Intended Outcome</p>
   <p>Increased value, utility and reuse.</p>



#ProvideComplementaryPresentations
==================================

Intended outcome:

To provide humans with immediate insight into the data.

#ProvideFeedbackToPublisher
===========================

My version of the intended outcome:


Help original publishers to determine how the data they post is being 
used, which in turn helps them justify publishing the data. Make 
publishers aware of the steps they can take to improve their data, thus 
leading to more and better data for everyone.

And this BP definitely needs a reference to the DUV!!

I would replace this:
<p>When you begin using a dataset in a new product, make a note of the 
publisher’s contact information, the URI of the dataset you used, and 
the date on which you contacted them. This can be done in comments 
within your code where the dataset is used. Follow the publisher’s 
preferred route to provide feedback. If they do not provide a route, 
look for contact information for the Web site hosting the data.</p>

With

<p>Follow the publisher’s preferred route to provide feedback. If they 
do not provide a route, publish information about your data useage using 
the Dataset Usage Vocabulary [[VOCAB-DUV]]. As a minimum, look for 
contact information for the Web site hosting the data.</p>


#FollowLicensingTerms
=====================

I would add to the implementation section:

<p>At the time of writing, work is under way at W3C to develop a <a 
href="/2016/poe/">Permissions and Obligations Expression</a> language 
based on the output of the <a href="/community/odrl/">ODRL Community 
Group</a>.</p>

Glossary
========

Consider merging Data Format and Representation since they both refer to 
he same definition.

Phil

On 16/04/2016 17:03, Bernadette Farias Lóscio wrote:
> Hi Phil,
>
> Thanks a lot for your detailed review! Your comments and suggestions are
> really important to improve the document. Please, find some comments below.
>
> 2016-04-15 4:32 GMT-03:00 Phil Archer <phila@w3.org>:
>
>> As flagged, I've been working on my native speaker review of the doc and,
>> in doing so, have been paying very close attention to the text. This leads
>> me to make a number of comments that go beyond simple native speaker edits
>> and are there ones that should be assessed like any other comment.
>>
>> My review begins at the Data Formats section.
>>
>> #MachineReadableStandardizedFormat
>> ===================================
>>
>> There is no definition of 'machine readable', or of proprietary software.
>> "computational tools typically available in the relevant domain" will
>> surely include .docx and .xlsx, for example.
>>
>> I looked at the Wikipedia page which links to a doc from the US government
>> https://en.wikipedia.org/wiki/Machine-readable_data. from that I suggest
>> the following:
>>
>>
>> <p>There is an important distinction between formats that can be read and
>> edited by humans using a computer and formats that are <em>machine
>> readable</em>. The latter term implies that the data is readily extracted,
>> transformed and processed by a computer. The following definition of
>> machine readable is based on that provided by the US Office of Management
>> and Budget's definition in their Preparation and Submission of Strategic
>> Plans, Annual Performance Plans, and Annual Program Performance Reports
>> [[OMB-A11]]</p>
>> <p><strong>Machine readable</strong>: A format in a standard computer
>> language (not natural language text) that can be read automatically by a
>> computer system. Traditional word processing documents and portable
>> document format (PDF) files are easily read by humans but typically are
>> difficult for machines to interpret. Formats such as XML, JSON, NetCDF, RDF
>> or spreadsheets with header columns that can be exported as CSV are machine
>> readable formats.</p>
>>
>>
>> Biblio entry
>>
>>         "OMB-A11": {
>>                "title": "Preparation and Submission of Strategic Plans,
>> Annual Performance Plans, and Annual Program Performance Reports",
>>           "href":"
>> https://www.whitehouse.gov/sites/default/files/omb/assets/a11_current_year/s200.pdf
>> ",
>>          "date": "2015",
>>          "publisher":"Office of Management and Budget (OMB)",
>>          "id":"OMB Circular A-11"
>>
>
> I suggest to include the first paragraph in the Why section of the BP and
> the second one in the glossary.
>
>
>
>> #MultipleFormats
>> ================
>>
>> Suggest that the intended outcome could be worded along the lines of:
>>
>> "As many users as possible will be able to use the data without first
>> having to transform it into their preferred format."
>>
>> I have  many similar comments on intended outcomes. I think they should be
>> statements of the specific benefit that is gained, so "to enable X" rather
>> than "Doing X will enable Y."
>>
>
> I agree! We were not sure about the best way to present the intended
> outcomes. We're gonna review the other BP considering your proposal.
>
>
>>
>> I very much dislike the word 'intended' in the sentence: "Consider the
>> data formats most likely to be needed by intended users, and consider
>> alternatives that are likely to be useful in the future." The idea of
>> making data on the WEb is that it's up to the user to decide that he/she
>> intends to do with it, not the publisher.
>>
>> Suggest simply making it "Consider the data formats most likely to be
>> needed and consider alternatives that are likely to be useful in the future.
>>
>
> yes, when publishing data on the Web it can be difficult to know the
> "intended users".
>
>
>>
>> #MetadataStandardized
>> =====================
>>
>> Suggest rewording the intended outcome
>>
>> Currently:
>> Standardized code lists and other commonly used terms will enhance
>> interoperability and consensus among data publishers and consumers.
>>
>> Could be:
>> Enhanced interoperability and consensus among data publishers and
>> consumers.
>>
>
> I agree, but before making the change I think we should discuss this
> proposal with Antoine.
>
>
>> #ReuseVocabularies
>> ==================
>> Again, the intended outcome could be worded more succinctly I think.
>>
>> "Using the same vocabulary to describe metadata will make datasets and
>> metadata sets easier to be compared by humans or machines. When two
>> datasets or metadata sets use the same vocabulary, (automatic) processing
>> tools designed for one can be more easily applied to the other. This
>> greatly facilitates re-use of datasets"
>>
>> could be simply
>>
>> To make datasets and metadata easier to compare and integrate by humans or
>> machines.
>>
>> (I added 'and integrate', which I personally think is important but this
>> is more than an editorial change).
>>
>
> I agree, but before making the change I think we should discuss this
> proposal with Antoine.
>
>
>>
>> #ChooseRightFormalizationLevel
>> ==============================
>>
>> I would word the intended outcome as:
>>
>> The data supports a wide range of application cases but is not more
>> complex to produce and reuse than necessary, or, to paraphrase Albert
>> Einstein, "Everything should be made as simple as possible, but no simpler."
>>
>> The Einstein line is often quoted but, like so many quotations, is
>> probably a misquote.
>>
>> And I'd say that the how to test line would be improved by using the word
>> 'typical' rather than target:
>>
>> For formal knowledge representation languages, applying an inference
>> engine on top of the data that uses a given vocabulary does not produce too
>> many statements that are unnecessary for typical applications.
>>
>
> I'm gonna send a specific message to Antoine asking feedback about your
> proposed changes.
>
>
>> #Sensitive
>> ==========
>>
>> I'd word the intended outcome as:
>>
>> "To enable data consumers to know that data that is referred to from the
>> current dataset is unavailable or only available under different
>> conditions."
>>
>> I changed the reference to HTTP status code 404 to 303 (see other) when
>> doing the native speaker review. I *really* don't want us to include
>> deliberate 404s as a Best Practice :-(
>>
>
> ok!
>
>
>> #BulkAccess
>> ===========
>>
>> I don't think this should only refer to cases where data is spread across
>> multiple locations. I think it shoujld also cover the simple case of making
>> a file available, as opposed to only providing an API. This is in addition
>> to, not instead of what is written about multiple locations - which I think
>> is very good.
>>
>> I'd phrase the intended outcome as:
>>
>> "Bulk download enables developers to access the complete dataset for local
>> processing without the need for further calls to the Web."
>>
>
> I propose to complement the Why section to include "the simple case of
> making a file available". For the intended outcome I propose:
>
> "To enable developers to access the complete dataset for local processing
> without the need for further calls to the Web."
>
>
>> #ProvideSubsets
>> ===============
>>
>> The intended outcome section is too long IMO. All the content is valid, I
>> just think some of it could be moved to the Why section.
>>
>> Really not sure about include an example of making a set of PDFs available.
>>
>
> I agree with you! I already discussed the PDF point with Annette. Let's
> discuss this issue with Annette.
>
>
>>
>> #Conneg
>> =======
>>
>> In tidying up the language of this BP I pretty much rewrote it. I hope
>> without changing your meaning significantly.
>>
>> I suggest the intended outcome could be phrased as: "To enable different
>> representations of the same resource to be served fromt he same URI
>> according to the request made by the client."
>>
>
> I'm gonna check with Newton if he is ok with your proposal.
>
>
>> #AccessRealTime
>> ===============
>>
>> I would word the intended outcome as:
>>
>> "To enable applications to access time-critical data in real time or near
>> real time, where real-time means a range from milliseconds to a few seconds
>> after the data creation, and near real time is a predetermined delay for
>> expected data delivery."
>>
>
> I agree!
>
>
>> #AccessUptoDate
>> ===============
>>
>> I think this sentence: "The international date format is recommended to
>> avoid any ambiguity <a href="
>> https://www.w3.org/International/questions/qa-date-format">
>> https://www.w3.org/International/questions/qa-date-format</a>."
>>
>> Would be better as:
>>
>> "Datestamps should be formatted using the XML Schema <a
>> href="/TR/xmlschema11-2/#dateTimeStamp">dateTimeStamp</a> datatype
>> [[xmlschema11-2]]."
>>
>> Although I note that the NOAA example uses the horrible "Mar, 3rd 2016 at
>> 9:03:07 pm PST" format which breaks this advice :-(
>>
>
> I was discussing this BP with Annette and I think we should make more
> updates. I'm gonna try to rewrite a proposal.
>
>
>> #documentYourAPI
>> ================
>>
>> I'd write the intended outcome as:
>>
>> "Developers can obtain detailed information about each call to the API,
>> including the parameters it takes and what it is expected to return."
>>
>
> I like it! I think the current version is too long.
>
>
>>
>> #documentYourAPI
>> ================
>>
>> This is very spatial, ideally we should have some non-spatial examples as
>> well. I can tell this came from Linda and Jeremy et al :-)
>>
>
> I think the example section is a mixture of approach to implementation and
> examples. We're gonna review this and make a proposal.
>
>
>>
>> #EvaluateCoverage
>> =================
>>
>> I'd phrase the intended outcome as
>>
>> "To enable data consumers to appreciate the coverage and external
>> dependencies of a given dataset."
>>
>
> I agree!
>
>
>>
>> #Serialisation
>> ==============
>>
>> Intended outcome suggestion:
>>
>> To enable machines to process a dataset even if the original software that
>> was used to create it is no longer available or supported.
>>
>
> I agree!
>
>
>
>> More later
>>
>
> Looking forward to your comments!
>
> We're gonna wait until we have more feedback from the group to see if we
> have contradictory comments or proposals. Then we're gonna present to the
> group the proposal updates based on the member's feedback .
>
> Thanks a lot!
>
> Berna
>
>>
>> Phil.
>>
>>
>>
>>
>> --
>>
>>
>> Phil Archer
>> W3C Data Activity Lead
>> http://www.w3.org/2013/data/
>>
>> http://philarcher.org
>> +44 (0)7887 767755
>> @philarcher1
>>
>>
>
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Tuesday, 19 April 2016 13:29:59 UTC