Re: partial review

Good stuff, Phil.
I think we still need to hash out our differences over conneg. I 
addressed that in another note, though. See a few small notes inline below.
-Annette


On 4/18/16 2:28 PM, Phil Archer wrote:
> Some comments below on Annette's comments before I do the native 
> speaker review on the top half of the doc, I don't want to replicate 
> Annette's work.
>
> On 15/04/2016 03:12, Annette Greiner wrote:
> [..]
>
>>
>>
>> General issues
>> -- 
>>
>> Possible approaches to implementation should not include the word
>> "should". That implies normativeness. This is a general issue with
>> implementation sections. We say in the Audience section that "The
>> normative element of each best practice is the intended outcome."
>>
>> Subtitles should all be written in the same mode. (Mine were written in
>> imperative -- "do this, don't do that", but most are declarative --
>> "this should be done".) I think imperative is better, because it gets
>> away from RFC2119 keywords, which we voted not to use. It becomes a call
>> to action, which is our goal, right?
>
> +1
>
>>
>>
>> 1. provide metadata
>> -- 
>>
>> The intended outcome is "Human-readable metadata will enable humans to
>> understand the metadata and machine-readable metadata will enable
>> computer applications, notably user agents, to process the metadata."
>> This is tautological. Metadata is necessary because, without it, the
>> data will have no context or meaning.
>>
>
> +1
>
> I'd write the intended outcome as simply:
>
> Humans and machines are able to understand the data.
>
>> Possible approach to implementation should not include the word
>> "should". Also, I disagree that "If multiple formats are published
>> separately, they should be served from the same URL using content
>> negotiation." publishing multiple files is also reasonable, and it's
>> even what we used in all our examples about metadata. (in BP2, the
>> machine readable example gives the name of the distribution as
>> bus-stops-2015-05-05.csv; in BP4, the entire URI is given, ending in
>> .csv, etc.)
>
> I think BP21 (#conneg) gets it right. You assign a URI to the dataset 
> and use conneg to return whatever is the most appropriate version. 
> However, you *also* provide direct URIs for each version, that by pass 
> the conneg. So you'd have
>
> http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops 
> that returned either CSV or JSON, but if you want one of those 
> specifically, then you go to 
> http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops.csv 
> etc.
>
>
>>
>>
>> 2. descriptive metadata
>> -- 
>
> I would word the intended outcome as:
>
> Humans and machines can discover the dataset; humans can understand 
> the nature of the data.
>
>
>>
>> There is an inconsistency between the suggestion that one should use
>> content negotiation for different formats (csv vs. rdf) and the .
>> :mobility and :themes are referred to as URIs, but they are not URIs. (I
>> know DCAT did this, but I think it's a mistake; colons are not legal in
>> the first segment of a relative URI.)
>
> I see the point. I'm so used to seeing that notation from Turtle. I'd 
> slightly reword that section and take out the colons as they refer 
> specifically to Turtle representation. I end up with:
>
> <p>The example below shows how to use [[VOCAB-DCAT]] to provide the 
> machine readable <strong> discovery </strong> metadata for the bus 
> stops dataset (<code>bus-stops-2015-05-05</code>). The dataset has one 
> CSV distribution (<code>bus-stops-2015-05-05.csv</code>) that is also 
> described using DCAT. The dataset is classified under the SKOS concept 
> represented by <code>mobility</code>. This concept may be defined as 
> part of a set of concept scheme (<code>:themes</code>).</p>
>
> <p>To express the update frequency, John used the SDMX Content 
> Oriented Guidelines as described in the RDF Data Cube vocabulary 
> [[VOCAB-DATA-CUBE]]; and for the spatial and temporal coverage, URIs 
> from <a href="http://www.geonames.org/">Geonames</a> and the UK 
> government's <a 
> href="http://reference.data.gov.uk/id/interval">Interval dataset</a> 
> from data.gov.uk, respectively.</p>
>
>
> And it's not inconsistent with the advice on conneg IMO for 2 reasons:
>
> 1. Setting up conneg on the dataset URI directs you to the specific 
> distribution, which is fine.
>
> 2. Distributions and datasets are not disjoint.
>
If we say that people should use only conneg, then it is inconsistent. 
If we say they should use both, then we are recommending something that 
I don't think is generally considered a best practice, at least not yet.
>>
>>
>> 3. locale parameters
>> -- 
>
> I think the Why section is unnecessarily repetitive. A textual example 
> might also clarify things a little. I suggest:
>
> <p>Providing <a href="#locale_parameter">locale</a> parameters helps 
> humans and computer applications to work accurately with things like 
> dates, currencies and numbers that may look similar but have different 
> meanings in different locales. For example, the 'date' 4/7 can be read 
> as 7th of April or the 4th of July depending on where the data was 
> created. Similarly €2,000 is either two thousand Euros or an 
> over-precise representation of two Euros. Making the locale and 
> language explicit allows users to determine how readily they can work 
> with the data and may enable automated translation services.</p>
>
>
>
> My wording for the intended outcome:
>
> To enable humans and software agents accurately to interpret the 
> meaning of strings representing dates, times, currencies and numbers etc.
>
>>
>> The human-readable example for the first three BPs is exactly the same.
>> Can we make the examples more specific (maybe include them in the doc
>> rather than link to one big external example)? The ttl in the
>> machine-readable example could be trimmed to just the bold parts.
>
> +1. All the data is in the HTML and TTL files, just highlight the 
> relevant bits by including those and those only in the main doc.
>
>
> Incidentally, I expect to set up conneg between those two files, yes?
>
which two? The html and ttl have different content, or am I thinking of 
the wrong pair?
>
> 4A. Provide structural metadata
>
> I think the why section could be stronger:
>
> <p>Providing information about the internal structure of a 
> distribution is essential for others wishing to explore or query the 
> dataset. It also helps people to understand the meaning of the data.</p>
>
> My intended outcome wording:
>
> To enable humans to interpret the schema of a dataset and software 
> agents to automatically process distributions.
>
> NB, I removed the 2nd instance of the word schema in that sentence 
> which I think was a mistake?
>
>
>>
>>
>> 5. Licenses
>> -- 
>>
>> We say "the license of a dataset can be specified within the data". I
>> think we mean within the *metadata*.
>
> +1 Suggested rewording:
>
> <p>The presence of license information is essential for data consumers 
> to assess the usability of data. User agents may use the 
> presence/absence of license information as a trigger for inclusion or 
> exclusion of data presented to a potential consumer.</p>
>
>
>> The "Why" misuses the phrase "for example." User agent actions are not
>> an example of data consumer actions.
>> We say "Data license information can be provided as a link to a
>> human-readable license or as a link/embedded machine-readable license."
>> Since licensing info is part of metadata, and we tell people to provide
>> metadata for both humans and machines, we should also require licensing
>> info for both humans and machines.
>>
>>
>> 6. Provenance
>> -- 
>
> I think the first paragraph of the intro section can be removed and 
> the glossary link added to the 2nd, like:
>
> <p>The Web brings together business, engineering, and scientific 
> communities creating collaborative opportunities that were previously 
> unimaginable. The challenge in publishing data on the Web is providing 
> an appropriate level of detail about its origin. The <a 
> href="#data_producer">data producer</a> may not necessarily be the 
> data provider and so collecting and conveying this corresponding 
> metadata is particularly important. Without <a 
> href="#data_provenance">provenance</a>, consumers have no inherent way 
> to trust the integrity and credibility of the data being shared. Data 
> publishers in turn need to be aware of the needs of prospective 
> consumer communities to know how much provenance detail is 
> appropriate. </p>
>
>
>
>>
>> The "Why" is pretty sparse and essentially says the same thing as the
>> intended outcome. I think we could make it stronger. "Provenance is one
>> means by which consumers of a dataset judge its quality. Understanding
>> its origin and history helps one determine whether to trust the data and
>> provides important interpretive context."
>>
>
> +1
>
> My suggested wording for the intended outcome is:
>
> To enable humans to know the origin or history of the dataset and to 
> enable software agents to automatically process provenance information.
>
>
>> The example links to the metadata example page. It would be more helpful
>> to put the provenance-specific info into the BP doc itself.
>
> +1, as noted above.
>
>
>>
>>
>> 7. Quality
>> -- 
>
> Slight rewording of the intro paragraph:
>
> <p>The quality of a dataset can have a big impact on the quality of 
> applications that use it. As a consequence, the inclusion of <a 
> href="#data_quality">data quality</a> information in data publishing 
> and consumption pipelines is of primary importance. Usually, the 
> assessment of quality involves different kinds of quality dimensions, 
> each representing groups of characteristics that are relevant to 
> publishers and consumers. The Data Quality Vocabulary defines concepts 
> such as measures and metrics to assess the quality for each quality 
> dimension [[VOCAB-DQV]]. There are heuristics designed to fit specific 
> assessment situations that rely on quality indicators, namely, pieces 
> of data content, pieces of data meta-information, and human ratings 
> that give indications about the suitability of data for some intended 
> use.</p>
>
>
>
>>
>> We say "Data quality information will enable humans to know the quality
>> of the dataset and its distributions, and software agents to
>> automatically process quality information about the dataset and its
>> distributions." That's rather tautological. We could say something about
>> enabling humans to determine whether the dataset is suitable for their
>> purposes.
>
> Annette and I are in agreement here. I'd phrase the intended outcome as:
>
> To enable people and software to assess the quality and therefore 
> suitability of a dataset for their application.
>
>
>>
>> We probably should refer to DQV as a finished thing, as it will be soon.
>
> +1
>
> I suggest:
>
> <p>The machine readable version of the dataset quality metadata may
> be provided using the Data Quality Vocabulary developed by the <abbr 
> title="Data on the Web Best Practices">DWBP</abbr> working group 
> [[VOCAB-DQV]]. </p>
>
>
>>
>> The human-readable example links to the metadata one.
>
> (which I think is deliberate?)
>
>>
>>
>> 8. Versioning
>> -- 
>
> Looking at the intro material I think I could probably find people to 
> argue that all three of those scenarios are simply corrections rather 
> than new versions. But then, as you say, there is no consensus :-)
>
> I would phrase the intended outcome as:
>
> To enable humans and software agents to easily determine which version 
> of a dataset they are working with.
I'm not sure I like the approach of nonsentences (verb phrase only) for 
these.
>
>>
>> Of the four implementation bullets, only the last is really a possible
>> approach. The first three belong in the intended outcome.
>
> Unusually, I disagree with Annette here. For me, intended outcomes are 
> short "this is what will be possible." The implementation steps are 
> how you make it so, which I think you have in this case.
>
We'll need to make the styles for these consistent. But my concern is 
not about style but rather about content. If we are giving "guidelines" 
we are not giving a sample approach for how to implement those guidelines.
>
>>
>> The human-readable example links to the metadata one. The version
>> history there lists only 1.1, which is illogical. (1.0 must exist at
>> least.)
>>
>>
>> 9. Version history
>> -- 
>>
>> The human-readable example links to the metadata one. The version
>> history there lists only 1.1, which is illogical. (1.0 must exist at
>> least.) This example doesn't meet the requirements of the BP.
>>
>> Neither the ttl version nor the Memento example provides a full version
>> history, only a list of versions released. This BP is intended to be
>> about providing the details of what changed.
>
> Hmm... Not sure how we'd offer advice on providing machine readable 
> deltas. It's possible of course, but it's a bit of a stretch for our 
> current work and will be very context dependent. And there is more 
> info in Memento than I think is warranted.
>
> Sorry but I'm inclined to suggest that the second versioning BP is 
> dropped, but add into the first one something about at least including 
> free text description of what has changed between the various versions.
>
Don't throw the baby out with the bath water. We can toss the mention of 
machine-readable version history, but still advise people to include a 
human-readable one.
>>
>>
>> Intro to Identifiers
>> -- 
>>
>> Intro item 5 refers to an API which could be confusing, since we talk
>> about APIs as web APIs elsewhere.
>
> OK, so how about this (shorter) alternative:
>
> De-referencing a URI triggers a computer program to run on a server 
> that may do something as simple as return a single, static file, or it 
> may carry out complex processing. Precisely what processing is carried 
> out, i.e. the software on the server, is completely independent of the 
> URI itself.
>
>
+1
>>
>>
>> 10. Persistent URIs as identifiers
>> -- 
>>
>> We say "This requires a different mindset to that used when creating a
>> Web site designed for humans to navigate their way through." When
>> creating a web site for humans to navigate, one should also consider
>> persistence, so that sentence is not strictly accurate.
>
> I agree. OK, delete that sentence so it's just:
>
> To be persistent, URIs must be designed as such. A lot has been 
> written on this topic, see, for example, the European Commission's 
> Study on Persistent URIs [PURI] which in turn links to many other 
> resources.
>
> I'll come back to the remainder of Annette's comments tomorrow as I 
> can see they need more energy than I can muster this evening.
>
> Phil.
>
>
>
>
>>
>> The example uses the city domain instead of the transport agency's
>> domain, which is not realistic for a large city. The agency domain is
>> likely to persist as long as the information it makes available is
>> relevant. Try Googling "transit agency" and see what comes up for domain
>> names. The issue depends on how stable the transit service is. For a
>> small town, the transit function might not be given over to a separate
>> agency, and the guidance would be right, but for a big city, where the
>> transit function is run by an independent agency, it's not realistic.
>>
>> The example is rather redundant. It is data.mycity..., and yet /dataset
>> also appears in the path. The path also contains /bus as well as
>> /bus-stops. It's unlikely that the agency has so many transit modes that
>> they need to be split between road and rail and water. The same info is
>> conveyed as well by the much shorter
>> http://data.mycitytransit.example.org/bus/stops
>>
>> We say "Ideally, the relevant Web site includes a description of the
>> process..." I think we mean a controlled scheme.
>>
>>
>> 11. Persistent URIs within datasets
>> -- 
>>
>> The word "affordances" is misused. Affordances are how we know what
>> something is intended to do, not what the thing does. Affordances do not
>> act on things, they inform.
>>
>> The intended outcome should be a free-standing piece of text. Starting
>> with "that one item" is confusing.
>>
>> Much of the implementation section is about minting new URIs, which is
>> the subject of the previous BP. It is off topic here. Everything from
>> "If you can't find an existing set of identifiers that meet your needs,
>> you'll need to create your own" down to the end of the example doesn't
>> belong in a BP that is about using other people's identifiers.
>>
>> The last paragraph of the example is almost exactly the same as the last
>> paragraph before the example.
>>
>>
>> 12.  URIs for versions and series
>> -- 
>>
>> This BP is confusing two issues. One is the use of a shorter URI for the
>> latest version of a dataset while also assigning a version-specific URI
>> for it. The other issue is making a landing page for a collection of
>> datasets. The initial intent was the former.
>>
>> The examples in the Why aren't series or groups except for the first
>> item, yet they are introduced as examples of series or groups.
>>
>> How to Test says to check "that logical groups of datasets are also
>> identifiable." That is vague. It should say "that a URI is also provided
>> for the latest version or most recent real-time value."
>>
>> I don't think this applies to time series. What we're talking about here
>> is use of dates for version identifiers.
>>
>> The example is incomplete; it doesn't say what the latest version URI
>> would be.
>>
>

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory

Received on Thursday, 21 April 2016 22:21:02 UTC