partial review

Whew, I've gotten through section 8.7. This is taking way too long, so 
I'm going to stop at this point and put this out. Following are things 
that I noticed in a partial read-through of the BP document.
-Annette


General issues
--

Possible approaches to implementation should not include the word 
"should". That implies normativeness. This is a general issue with 
implementation sections. We say in the Audience section that "The 
normative element of each best practice is the intended outcome."

Subtitles should all be written in the same mode. (Mine were written in 
imperative -- "do this, don't do that", but most are declarative -- 
"this should be done".) I think imperative is better, because it gets 
away from RFC2119 keywords, which we voted not to use. It becomes a call 
to action, which is our goal, right?


1. provide metadata
--

The intended outcome is "Human-readable metadata will enable humans to 
understand the metadata and machine-readable metadata will enable 
computer applications, notably user agents, to process the metadata."
This is tautological. Metadata is necessary because, without it, the 
data will have no context or meaning.

Possible approach to implementation should not include the word 
"should". Also, I disagree that "If multiple formats are published 
separately, they should be served from the same URL using content 
negotiation." publishing multiple files is also reasonable, and it's 
even what we used in all our examples about metadata. (in BP2, the 
machine readable example gives the name of the distribution as 
bus-stops-2015-05-05.csv; in BP4, the entire URI is given, ending in 
.csv, etc.)


2. descriptive metadata
--

There is an inconsistency between the suggestion that one should use 
content negotiation for different formats (csv vs. rdf) and the .
:mobility and :themes are referred to as URIs, but they are not URIs. (I 
know DCAT did this, but I think it's a mistake; colons are not legal in 
the first segment of a relative URI.)


3. locale parameters
--

The human-readable example for the first three BPs is exactly the same. 
Can we make the examples more specific (maybe include them in the doc 
rather than link to one big external example)? The ttl in the 
machine-readable example could be trimmed to just the bold parts.


5. Licenses
--

We say "the license of a dataset can be specified within the data". I 
think we mean within the *metadata*.
The "Why" misuses the phrase "for example." User agent actions are not 
an example of data consumer actions.
We say "Data license information can be provided as a link to a 
human-readable license or as a link/embedded machine-readable license." 
Since licensing info is part of metadata, and we tell people to provide 
metadata for both humans and machines, we should also require licensing 
info for both humans and machines.


6. Provenance
--

The "Why" is pretty sparse and essentially says the same thing as the 
intended outcome. I think we could make it stronger. "Provenance is one 
means by which consumers of a dataset judge its quality. Understanding 
its origin and history helps one determine whether to trust the data and 
provides important interpretive context."

The example links to the metadata example page. It would be more helpful 
to put the provenance-specific info into the BP doc itself.


7. Quality
--

We say "Data quality information will enable humans to know the quality 
of the dataset and its distributions, and software agents to 
automatically process quality information about the dataset and its 
distributions." That's rather tautological. We could say something about 
enabling humans to determine whether the dataset is suitable for their 
purposes.

We probably should refer to DQV as a finished thing, as it will be soon.

The human-readable example links to the metadata one.


8. Versioning
--

Of the four implementation bullets, only the last is really a possible 
approach. The first three belong in the intended outcome.

The human-readable example links to the metadata one. The version 
history there lists only 1.1, which is illogical. (1.0 must exist at least.)


9. Version history
--

The human-readable example links to the metadata one. The version 
history there lists only 1.1, which is illogical. (1.0 must exist at 
least.) This example doesn't meet the requirements of the BP.

Neither the ttl version nor the Memento example provides a full version 
history, only a list of versions released. This BP is intended to be 
about providing the details of what changed.


Intro to Identifiers
--

Intro item 5 refers to an API which could be confusing, since we talk 
about APIs as web APIs elsewhere.


10. Persistent URIs as identifiers
--

We say "This requires a different mindset to that used when creating a 
Web site designed for humans to navigate their way through." When 
creating a web site for humans to navigate, one should also consider 
persistence, so that sentence is not strictly accurate.

The example uses the city domain instead of the transport agency's 
domain, which is not realistic for a large city. The agency domain is 
likely to persist as long as the information it makes available is 
relevant. Try Googling "transit agency" and see what comes up for domain 
names. The issue depends on how stable the transit service is. For a 
small town, the transit function might not be given over to a separate 
agency, and the guidance would be right, but for a big city, where the 
transit function is run by an independent agency, it's not realistic.

The example is rather redundant. It is data.mycity..., and yet /dataset 
also appears in the path. The path also contains /bus as well as 
/bus-stops. It's unlikely that the agency has so many transit modes that 
they need to be split between road and rail and water. The same info is 
conveyed as well by the much shorter
http://data.mycitytransit.example.org/bus/stops

We say "Ideally, the relevant Web site includes a description of the 
process..." I think we mean a controlled scheme.


11. Persistent URIs within datasets
--

The word "affordances" is misused. Affordances are how we know what 
something is intended to do, not what the thing does. Affordances do not 
act on things, they inform.

The intended outcome should be a free-standing piece of text. Starting 
with "that one item" is confusing.

Much of the implementation section is about minting new URIs, which is 
the subject of the previous BP. It is off topic here. Everything from 
"If you can't find an existing set of identifiers that meet your needs, 
you'll need to create your own" down to the end of the example doesn't 
belong in a BP that is about using other people's identifiers.

The last paragraph of the example is almost exactly the same as the last 
paragraph before the example.


12.  URIs for versions and series
--

This BP is confusing two issues. One is the use of a shorter URI for the 
latest version of a dataset while also assigning a version-specific URI 
for it. The other issue is making a landing page for a collection of 
datasets. The initial intent was the former.

The examples in the Why aren't series or groups except for the first 
item, yet they are introduced as examples of series or groups.

How to Test says to check "that logical groups of datasets are also 
identifiable." That is vague. It should say "that a URI is also provided 
for the latest version or most recent real-time value."

I don't think this applies to time series. What we're talking about here 
is use of dates for version identifiers.

The example is incomplete; it doesn't say what the latest version URI 
would be.

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory

Received on Friday, 15 April 2016 02:12:39 UTC