Comments on the 9 April version of the BP doc

As flagged, I've been working on my native speaker review of the doc 
and, in doing so, have been paying very close attention to the text. 
This leads me to make a number of comments that go beyond simple native 
speaker edits and are there ones that should be assessed like any other 
comment.

My review begins at the Data Formats section.

#MachineReadableStandardizedFormat
===================================

There is no definition of 'machine readable', or of proprietary 
software. "computational tools typically available in the relevant 
domain" will surely include .docx and .xlsx, for example.

I looked at the Wikipedia page which links to a doc from the US 
government https://en.wikipedia.org/wiki/Machine-readable_data. from 
that I suggest the following:


<p>There is an important distinction between formats that can be read 
and edited by humans using a computer and formats that are <em>machine 
readable</em>. The latter term implies that the data is readily 
extracted, transformed and processed by a computer. The following 
definition of machine readable is based on that provided by the US 
Office of Management and Budget's definition in their Preparation and 
Submission of Strategic Plans, Annual Performance Plans, and Annual 
Program Performance Reports [[OMB-A11]]</p>
<p><strong>Machine readable</strong>: A format in a standard computer 
language (not natural language text) that can be read automatically by a 
computer system. Traditional word processing documents and portable 
document format (PDF) files are easily read by humans but typically are 
difficult for machines to interpret. Formats such as XML, JSON, NetCDF, 
RDF or spreadsheets with header columns that can be exported as CSV are 
machine readable formats.</p>


Biblio entry

        "OMB-A11": {
       "title": "Preparation and Submission of Strategic Plans, Annual 
Performance Plans, and Annual Program Performance Reports",
        
"href":"https://www.whitehouse.gov/sites/default/files/omb/assets/a11_current_year/s200.pdf",
         "date": "2015",
         "publisher":"Office of Management and Budget (OMB)",
         "id":"OMB Circular A-11"

#MultipleFormats
================

Suggest that the intended outcome could be worded along the lines of:

"As many users as possible will be able to use the data without first 
having to transform it into their preferred format."

I have  many similar comments on intended outcomes. I think they should 
be statements of the specific benefit that is gained, so "to enable X" 
rather than "Doing X will enable Y."


I very much dislike the word 'intended' in the sentence: "Consider the 
data formats most likely to be needed by intended users, and consider 
alternatives that are likely to be useful in the future." The idea of 
making data on the WEb is that it's up to the user to decide that he/she 
intends to do with it, not the publisher.

Suggest simply making it "Consider the data formats most likely to be 
needed and consider alternatives that are likely to be useful in the future.


#MetadataStandardized
=====================

Suggest rewording the intended outcome

Currently:
Standardized code lists and other commonly used terms will enhance 
interoperability and consensus among data publishers and consumers.

Could be:
Enhanced interoperability and consensus among data publishers and consumers.

#ReuseVocabularies
==================
Again, the intended outcome could be worded more succinctly I think.

"Using the same vocabulary to describe metadata will make datasets and 
metadata sets easier to be compared by humans or machines. When two 
datasets or metadata sets use the same vocabulary, (automatic) 
processing tools designed for one can be more easily applied to the 
other. This greatly facilitates re-use of datasets"

could be simply

To make datasets and metadata easier to compare and integrate by humans 
or machines.

(I added 'and integrate', which I personally think is important but this 
is more than an editorial change).


#ChooseRightFormalizationLevel
==============================

I would word the intended outcome as:

The data supports a wide range of application cases but is not more 
complex to produce and reuse than necessary, or, to paraphrase Albert 
Einstein, "Everything should be made as simple as possible, but no simpler."

The Einstein line is often quoted but, like so many quotations, is 
probably a misquote.

And I'd say that the how to test line would be improved by using the 
word 'typical' rather than target:

For formal knowledge representation languages, applying an inference 
engine on top of the data that uses a given vocabulary does not produce 
too many statements that are unnecessary for typical applications.

#Sensitive
==========

I'd word the intended outcome as:

"To enable data consumers to know that data that is referred to from the 
current dataset is unavailable or only available under different 
conditions."

I changed the reference to HTTP status code 404 to 303 (see other) when 
doing the native speaker review. I *really* don't want us to include 
deliberate 404s as a Best Practice :-(

#BulkAccess
===========

I don't think this should only refer to cases where data is spread 
across multiple locations. I think it shoujld also cover the simple case 
of making a file available, as opposed to only providing an API. This is 
in addition to, not instead of what is written about multiple locations 
- which I think is very good.

I'd phrase the intended outcome as:

"Bulk download enables developers to access the complete dataset for 
local processing without the need for further calls to the Web."

#ProvideSubsets
===============

The intended outcome section is too long IMO. All the content is valid, 
I just think some of it could be moved to the Why section.

Really not sure about include an example of making a set of PDFs available.


#Conneg
=======

In tidying up the language of this BP I pretty much rewrote it. I hope 
without changing your meaning significantly.

I suggest the intended outcome could be phrased as: "To enable different 
representations of the same resource to be served fromt he same URI 
according to the request made by the client."

#AccessRealTime
===============

I would word the intended outcome as:

"To enable applications to access time-critical data in real time or 
near real time, where real-time means a range from milliseconds to a few 
seconds after the data creation, and near real time is a predetermined 
delay for expected data delivery."

#AccessUptoDate
===============

I think this sentence: "The international date format is recommended to 
avoid any ambiguity <a 
href="https://www.w3.org/International/questions/qa-date-format">https://www.w3.org/International/questions/qa-date-format</a>."

Would be better as:

"Datestamps should be formatted using the XML Schema <a 
href="/TR/xmlschema11-2/#dateTimeStamp">dateTimeStamp</a> datatype 
[[xmlschema11-2]]."

Although I note that the NOAA example uses the horrible "Mar, 3rd 2016 
at 9:03:07 pm PST" format which breaks this advice :-(

#documentYourAPI
================

I'd write the intended outcome as:

"Developers can obtain detailed information about each call to the API, 
including the parameters it takes and what it is expected to return."

#documentYourAPI
================

This is very spatial, ideally we should have some non-spatial examples 
as well. I can tell this came from Linda and Jeremy et al :-)

#EvaluateCoverage
=================

I'd phrase the intended outcome as

"To enable data consumers to appreciate the coverage and external 
dependencies of a given dataset."

#Serialisation
==============

Intended outcome suggestion:

To enable machines to process a dataset even if the original software 
that was used to create it is no longer available or supported.

More later

Phil.




-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1

Received on Friday, 15 April 2016 07:32:40 UTC