Re: partial review from Phil Archer on 2016-04-18 (public-dwbp-wg@w3.org from April 2016)

From: Phil Archer <phila@w3.org>
Date: Mon, 18 Apr 2016 22:28:06 +0100
To: Annette Greiner <amgreiner@lbl.gov>, DWBP Public List <public-dwbp-wg@w3.org>
Message-ID: <57155166.5060305@w3.org>
Some comments below on Annette's comments before I do the native speaker 
review on the top half of the doc, I don't want to replicate Annette's work.

On 15/04/2016 03:12, Annette Greiner wrote:
[..]

>
>
> General issues
> --
>
> Possible approaches to implementation should not include the word
> "should". That implies normativeness. This is a general issue with
> implementation sections. We say in the Audience section that "The
> normative element of each best practice is the intended outcome."
>
> Subtitles should all be written in the same mode. (Mine were written in
> imperative -- "do this, don't do that", but most are declarative --
> "this should be done".) I think imperative is better, because it gets
> away from RFC2119 keywords, which we voted not to use. It becomes a call
> to action, which is our goal, right?

+1

>
>
> 1. provide metadata
> --
>
> The intended outcome is "Human-readable metadata will enable humans to
> understand the metadata and machine-readable metadata will enable
> computer applications, notably user agents, to process the metadata."
> This is tautological. Metadata is necessary because, without it, the
> data will have no context or meaning.
>

+1

I'd write the intended outcome as simply:

Humans and machines are able to understand the data.

> Possible approach to implementation should not include the word
> "should". Also, I disagree that "If multiple formats are published
> separately, they should be served from the same URL using content
> negotiation." publishing multiple files is also reasonable, and it's
> even what we used in all our examples about metadata. (in BP2, the
> machine readable example gives the name of the distribution as
> bus-stops-2015-05-05.csv; in BP4, the entire URI is given, ending in
> .csv, etc.)

I think BP21 (#conneg) gets it right. You assign a URI to the dataset 
and use conneg to return whatever is the most appropriate version. 
However, you *also* provide direct URIs for each version, that by pass 
the conneg. So you'd have

http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops 
that returned either CSV or JSON, but if you want one of those 
specifically, then you go to 
http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops.csv 
etc.


>
>
> 2. descriptive metadata
> --

I would word the intended outcome as:

Humans and machines can discover the dataset; humans can understand the 
nature of the data.


>
> There is an inconsistency between the suggestion that one should use
> content negotiation for different formats (csv vs. rdf) and the .
> :mobility and :themes are referred to as URIs, but they are not URIs. (I
> know DCAT did this, but I think it's a mistake; colons are not legal in
> the first segment of a relative URI.)

I see the point. I'm so used to seeing that notation from Turtle. I'd 
slightly reword that section and take out the colons as they refer 
specifically to Turtle representation. I end up with:

<p>The example below shows how to use [[VOCAB-DCAT]] to provide the 
machine readable <strong> discovery </strong> metadata for the bus stops 
dataset (<code>bus-stops-2015-05-05</code>). The dataset has one CSV 
distribution (<code>bus-stops-2015-05-05.csv</code>) that is also 
described using DCAT. The dataset is classified under the SKOS concept 
represented by <code>mobility</code>. This concept may be defined as 
part of a set of concept scheme (<code>:themes</code>).</p>

<p>To express the update frequency, John used the SDMX Content Oriented 
Guidelines as described in the RDF Data Cube vocabulary 
[[VOCAB-DATA-CUBE]]; and for the spatial and temporal coverage, URIs 
from <a href="http://www.geonames.org/">Geonames</a> and the UK 
government's <a href="http://reference.data.gov.uk/id/interval">Interval 
dataset</a> from data.gov.uk, respectively.</p>


And it's not inconsistent with the advice on conneg IMO for 2 reasons:

1. Setting up conneg on the dataset URI directs you to the specific 
distribution, which is fine.

2. Distributions and datasets are not disjoint.

>
>
> 3. locale parameters
> --

I think the Why section is unnecessarily repetitive. A textual example 
might also clarify things a little. I suggest:

<p>Providing <a href="#locale_parameter">locale</a> parameters helps 
humans and computer applications to work accurately with things like 
dates, currencies and numbers that may look similar but have different 
meanings in different locales. For example, the 'date' 4/7 can be read 
as 7th of April or the 4th of July depending on where the data was 
created. Similarly €2,000 is either two thousand Euros or an 
over-precise representation of two Euros. Making the locale and language 
explicit allows users to determine how readily they can work with the 
data and may enable automated translation services.</p>



My wording for the intended outcome:

To enable humans and software agents accurately to interpret the meaning 
of strings representing dates, times, currencies and numbers etc.

>
> The human-readable example for the first three BPs is exactly the same.
> Can we make the examples more specific (maybe include them in the doc
> rather than link to one big external example)? The ttl in the
> machine-readable example could be trimmed to just the bold parts.

+1. All the data is in the HTML and TTL files, just highlight the 
relevant bits by including those and those only in the main doc.


Incidentally, I expect to set up conneg between those two files, yes?


4A. Provide structural metadata

I think the why section could be stronger:

<p>Providing information about the internal structure of a distribution 
is essential for others wishing to explore or query the dataset. It also 
helps people to understand the meaning of the data.</p>

My intended outcome wording:

To enable humans to interpret the schema of a dataset and software 
agents to automatically process distributions.

NB, I removed the 2nd instance of the word schema in that sentence which 
I think was a mistake?


>
>
> 5. Licenses
> --
>
> We say "the license of a dataset can be specified within the data". I
> think we mean within the *metadata*.

+1 Suggested rewording:

<p>The presence of license information is essential for data consumers 
to assess the usability of data. User agents may use the 
presence/absence of license information as a trigger for inclusion or 
exclusion of data presented to a potential consumer.</p>


> The "Why" misuses the phrase "for example." User agent actions are not
> an example of data consumer actions.
> We say "Data license information can be provided as a link to a
> human-readable license or as a link/embedded machine-readable license."
> Since licensing info is part of metadata, and we tell people to provide
> metadata for both humans and machines, we should also require licensing
> info for both humans and machines.
>
>
> 6. Provenance
> --

I think the first paragraph of the intro section can be removed and the 
glossary link added to the 2nd, like:

<p>The Web brings together business, engineering, and scientific 
communities creating collaborative opportunities that were previously 
unimaginable. The challenge in publishing data on the Web is providing 
an appropriate level of detail about its origin. The <a 
href="#data_producer">data producer</a> may not necessarily be the data 
provider and so collecting and conveying this corresponding metadata is 
particularly important. Without <a 
href="#data_provenance">provenance</a>, consumers have no inherent way 
to trust the integrity and credibility of the data being shared. Data 
publishers in turn need to be aware of the needs of prospective consumer 
communities to know how much provenance detail is appropriate. </p>



>
> The "Why" is pretty sparse and essentially says the same thing as the
> intended outcome. I think we could make it stronger. "Provenance is one
> means by which consumers of a dataset judge its quality. Understanding
> its origin and history helps one determine whether to trust the data and
> provides important interpretive context."
>

+1

My suggested wording for the intended outcome is:

To enable humans to know the origin or history of the dataset and to 
enable software agents to automatically process provenance information.


> The example links to the metadata example page. It would be more helpful
> to put the provenance-specific info into the BP doc itself.

+1, as noted above.


>
>
> 7. Quality
> --

Slight rewording of the intro paragraph:

<p>The quality of a dataset can have a big impact on the quality of 
applications that use it. As a consequence, the inclusion of <a 
href="#data_quality">data quality</a> information in data publishing and 
consumption pipelines is of primary importance. Usually, the assessment 
of quality involves different kinds of quality dimensions, each 
representing groups of characteristics that are relevant to publishers 
and consumers. The Data Quality Vocabulary defines concepts such as 
measures and metrics to assess the quality for each quality dimension 
[[VOCAB-DQV]]. There are heuristics designed to fit specific assessment 
situations that rely on quality indicators, namely, pieces of data 
content, pieces of data meta-information, and human ratings that give 
indications about the suitability of data for some intended use.</p>



>
> We say "Data quality information will enable humans to know the quality
> of the dataset and its distributions, and software agents to
> automatically process quality information about the dataset and its
> distributions." That's rather tautological. We could say something about
> enabling humans to determine whether the dataset is suitable for their
> purposes.

Annette and I are in agreement here. I'd phrase the intended outcome as:

To enable people and software to assess the quality and therefore 
suitability of a dataset for their application.


>
> We probably should refer to DQV as a finished thing, as it will be soon.

+1

I suggest:

<p>The machine readable version of the dataset quality metadata may
be provided using the Data Quality Vocabulary developed by the <abbr 
title="Data on the Web Best Practices">DWBP</abbr> working group 
[[VOCAB-DQV]]. </p>


>
> The human-readable example links to the metadata one.

(which I think is deliberate?)

>
>
> 8. Versioning
> --

Looking at the intro material I think I could probably find people to 
argue that all three of those scenarios are simply corrections rather 
than new versions. But then, as you say, there is no consensus :-)

I would phrase the intended outcome as:

To enable humans and software agents to easily determine which version 
of a dataset they are working with.

>
> Of the four implementation bullets, only the last is really a possible
> approach. The first three belong in the intended outcome.

Unusually, I disagree with Annette here. For me, intended outcomes are 
short "this is what will be possible." The implementation steps are how 
you make it so, which I think you have in this case.


>
> The human-readable example links to the metadata one. The version
> history there lists only 1.1, which is illogical. (1.0 must exist at
> least.)
>
>
> 9. Version history
> --
>
> The human-readable example links to the metadata one. The version
> history there lists only 1.1, which is illogical. (1.0 must exist at
> least.) This example doesn't meet the requirements of the BP.
>
> Neither the ttl version nor the Memento example provides a full version
> history, only a list of versions released. This BP is intended to be
> about providing the details of what changed.

Hmm... Not sure how we'd offer advice on providing machine readable 
deltas. It's possible of course, but it's a bit of a stretch for our 
current work and will be very context dependent. And there is more info 
in Memento than I think is warranted.

Sorry but I'm inclined to suggest that the second versioning BP is 
dropped, but add into the first one something about at least including 
free text description of what has changed between the various versions.

>
>
> Intro to Identifiers
> --
>
> Intro item 5 refers to an API which could be confusing, since we talk
> about APIs as web APIs elsewhere.

OK, so how about this (shorter) alternative:

De-referencing a URI triggers a computer program to run on a server that 
may do something as simple as return a single, static file, or it may 
carry out complex processing. Precisely what processing is carried out, 
i.e. the software on the server, is completely independent of the URI 
itself.


>
>
> 10. Persistent URIs as identifiers
> --
>
> We say "This requires a different mindset to that used when creating a
> Web site designed for humans to navigate their way through." When
> creating a web site for humans to navigate, one should also consider
> persistence, so that sentence is not strictly accurate.

I agree. OK, delete that sentence so it's just:

To be persistent, URIs must be designed as such. A lot has been written 
on this topic, see, for example, the European Commission's Study on 
Persistent URIs [PURI] which in turn links to many other resources.

I'll come back to the remainder of Annette's comments tomorrow as I can 
see they need more energy than I can muster this evening.

Phil.




>
> The example uses the city domain instead of the transport agency's
> domain, which is not realistic for a large city. The agency domain is
> likely to persist as long as the information it makes available is
> relevant. Try Googling "transit agency" and see what comes up for domain
> names. The issue depends on how stable the transit service is. For a
> small town, the transit function might not be given over to a separate
> agency, and the guidance would be right, but for a big city, where the
> transit function is run by an independent agency, it's not realistic.
>
> The example is rather redundant. It is data.mycity..., and yet /dataset
> also appears in the path. The path also contains /bus as well as
> /bus-stops. It's unlikely that the agency has so many transit modes that
> they need to be split between road and rail and water. The same info is
> conveyed as well by the much shorter
> http://data.mycitytransit.example.org/bus/stops
>
> We say "Ideally, the relevant Web site includes a description of the
> process..." I think we mean a controlled scheme.
>
>
> 11. Persistent URIs within datasets
> --
>
> The word "affordances" is misused. Affordances are how we know what
> something is intended to do, not what the thing does. Affordances do not
> act on things, they inform.
>
> The intended outcome should be a free-standing piece of text. Starting
> with "that one item" is confusing.
>
> Much of the implementation section is about minting new URIs, which is
> the subject of the previous BP. It is off topic here. Everything from
> "If you can't find an existing set of identifiers that meet your needs,
> you'll need to create your own" down to the end of the example doesn't
> belong in a BP that is about using other people's identifiers.
>
> The last paragraph of the example is almost exactly the same as the last
> paragraph before the example.
>
>
> 12.  URIs for versions and series
> --
>
> This BP is confusing two issues. One is the use of a shorter URI for the
> latest version of a dataset while also assigning a version-specific URI
> for it. The other issue is making a landing page for a collection of
> datasets. The initial intent was the former.
>
> The examples in the Why aren't series or groups except for the first
> item, yet they are introduced as examples of series or groups.
>
> How to Test says to check "that logical groups of datasets are also
> identifiable." That is vague. It should say "that a URI is also provided
> for the latest version or most recent real-time value."
>
> I don't think this applies to time series. What we're talking about here
> is use of dates for version identifiers.
>
> The example is incomplete; it doesn't say what the latest version URI
> would be.
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Monday, 18 April 2016 21:28:19 UTC