- From: Annette Greiner <amgreiner@lbl.gov>
- Date: Thu, 21 Apr 2016 15:20:32 -0700
- To: Phil Archer <phila@w3.org>, DWBP Public List <public-dwbp-wg@w3.org>
Good stuff, Phil. I think we still need to hash out our differences over conneg. I addressed that in another note, though. See a few small notes inline below. -Annette On 4/18/16 2:28 PM, Phil Archer wrote: > Some comments below on Annette's comments before I do the native > speaker review on the top half of the doc, I don't want to replicate > Annette's work. > > On 15/04/2016 03:12, Annette Greiner wrote: > [..] > >> >> >> General issues >> -- >> >> Possible approaches to implementation should not include the word >> "should". That implies normativeness. This is a general issue with >> implementation sections. We say in the Audience section that "The >> normative element of each best practice is the intended outcome." >> >> Subtitles should all be written in the same mode. (Mine were written in >> imperative -- "do this, don't do that", but most are declarative -- >> "this should be done".) I think imperative is better, because it gets >> away from RFC2119 keywords, which we voted not to use. It becomes a call >> to action, which is our goal, right? > > +1 > >> >> >> 1. provide metadata >> -- >> >> The intended outcome is "Human-readable metadata will enable humans to >> understand the metadata and machine-readable metadata will enable >> computer applications, notably user agents, to process the metadata." >> This is tautological. Metadata is necessary because, without it, the >> data will have no context or meaning. >> > > +1 > > I'd write the intended outcome as simply: > > Humans and machines are able to understand the data. > >> Possible approach to implementation should not include the word >> "should". Also, I disagree that "If multiple formats are published >> separately, they should be served from the same URL using content >> negotiation." publishing multiple files is also reasonable, and it's >> even what we used in all our examples about metadata. (in BP2, the >> machine readable example gives the name of the distribution as >> bus-stops-2015-05-05.csv; in BP4, the entire URI is given, ending in >> .csv, etc.) > > I think BP21 (#conneg) gets it right. You assign a URI to the dataset > and use conneg to return whatever is the most appropriate version. > However, you *also* provide direct URIs for each version, that by pass > the conneg. So you'd have > > http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops > that returned either CSV or JSON, but if you want one of those > specifically, then you go to > http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops.csv > etc. > > >> >> >> 2. descriptive metadata >> -- > > I would word the intended outcome as: > > Humans and machines can discover the dataset; humans can understand > the nature of the data. > > >> >> There is an inconsistency between the suggestion that one should use >> content negotiation for different formats (csv vs. rdf) and the . >> :mobility and :themes are referred to as URIs, but they are not URIs. (I >> know DCAT did this, but I think it's a mistake; colons are not legal in >> the first segment of a relative URI.) > > I see the point. I'm so used to seeing that notation from Turtle. I'd > slightly reword that section and take out the colons as they refer > specifically to Turtle representation. I end up with: > > <p>The example below shows how to use [[VOCAB-DCAT]] to provide the > machine readable <strong> discovery </strong> metadata for the bus > stops dataset (<code>bus-stops-2015-05-05</code>). The dataset has one > CSV distribution (<code>bus-stops-2015-05-05.csv</code>) that is also > described using DCAT. The dataset is classified under the SKOS concept > represented by <code>mobility</code>. This concept may be defined as > part of a set of concept scheme (<code>:themes</code>).</p> > > <p>To express the update frequency, John used the SDMX Content > Oriented Guidelines as described in the RDF Data Cube vocabulary > [[VOCAB-DATA-CUBE]]; and for the spatial and temporal coverage, URIs > from <a href="http://www.geonames.org/">Geonames</a> and the UK > government's <a > href="http://reference.data.gov.uk/id/interval">Interval dataset</a> > from data.gov.uk, respectively.</p> > > > And it's not inconsistent with the advice on conneg IMO for 2 reasons: > > 1. Setting up conneg on the dataset URI directs you to the specific > distribution, which is fine. > > 2. Distributions and datasets are not disjoint. > If we say that people should use only conneg, then it is inconsistent. If we say they should use both, then we are recommending something that I don't think is generally considered a best practice, at least not yet. >> >> >> 3. locale parameters >> -- > > I think the Why section is unnecessarily repetitive. A textual example > might also clarify things a little. I suggest: > > <p>Providing <a href="#locale_parameter">locale</a> parameters helps > humans and computer applications to work accurately with things like > dates, currencies and numbers that may look similar but have different > meanings in different locales. For example, the 'date' 4/7 can be read > as 7th of April or the 4th of July depending on where the data was > created. Similarly €2,000 is either two thousand Euros or an > over-precise representation of two Euros. Making the locale and > language explicit allows users to determine how readily they can work > with the data and may enable automated translation services.</p> > > > > My wording for the intended outcome: > > To enable humans and software agents accurately to interpret the > meaning of strings representing dates, times, currencies and numbers etc. > >> >> The human-readable example for the first three BPs is exactly the same. >> Can we make the examples more specific (maybe include them in the doc >> rather than link to one big external example)? The ttl in the >> machine-readable example could be trimmed to just the bold parts. > > +1. All the data is in the HTML and TTL files, just highlight the > relevant bits by including those and those only in the main doc. > > > Incidentally, I expect to set up conneg between those two files, yes? > which two? The html and ttl have different content, or am I thinking of the wrong pair? > > 4A. Provide structural metadata > > I think the why section could be stronger: > > <p>Providing information about the internal structure of a > distribution is essential for others wishing to explore or query the > dataset. It also helps people to understand the meaning of the data.</p> > > My intended outcome wording: > > To enable humans to interpret the schema of a dataset and software > agents to automatically process distributions. > > NB, I removed the 2nd instance of the word schema in that sentence > which I think was a mistake? > > >> >> >> 5. Licenses >> -- >> >> We say "the license of a dataset can be specified within the data". I >> think we mean within the *metadata*. > > +1 Suggested rewording: > > <p>The presence of license information is essential for data consumers > to assess the usability of data. User agents may use the > presence/absence of license information as a trigger for inclusion or > exclusion of data presented to a potential consumer.</p> > > >> The "Why" misuses the phrase "for example." User agent actions are not >> an example of data consumer actions. >> We say "Data license information can be provided as a link to a >> human-readable license or as a link/embedded machine-readable license." >> Since licensing info is part of metadata, and we tell people to provide >> metadata for both humans and machines, we should also require licensing >> info for both humans and machines. >> >> >> 6. Provenance >> -- > > I think the first paragraph of the intro section can be removed and > the glossary link added to the 2nd, like: > > <p>The Web brings together business, engineering, and scientific > communities creating collaborative opportunities that were previously > unimaginable. The challenge in publishing data on the Web is providing > an appropriate level of detail about its origin. The <a > href="#data_producer">data producer</a> may not necessarily be the > data provider and so collecting and conveying this corresponding > metadata is particularly important. Without <a > href="#data_provenance">provenance</a>, consumers have no inherent way > to trust the integrity and credibility of the data being shared. Data > publishers in turn need to be aware of the needs of prospective > consumer communities to know how much provenance detail is > appropriate. </p> > > > >> >> The "Why" is pretty sparse and essentially says the same thing as the >> intended outcome. I think we could make it stronger. "Provenance is one >> means by which consumers of a dataset judge its quality. Understanding >> its origin and history helps one determine whether to trust the data and >> provides important interpretive context." >> > > +1 > > My suggested wording for the intended outcome is: > > To enable humans to know the origin or history of the dataset and to > enable software agents to automatically process provenance information. > > >> The example links to the metadata example page. It would be more helpful >> to put the provenance-specific info into the BP doc itself. > > +1, as noted above. > > >> >> >> 7. Quality >> -- > > Slight rewording of the intro paragraph: > > <p>The quality of a dataset can have a big impact on the quality of > applications that use it. As a consequence, the inclusion of <a > href="#data_quality">data quality</a> information in data publishing > and consumption pipelines is of primary importance. Usually, the > assessment of quality involves different kinds of quality dimensions, > each representing groups of characteristics that are relevant to > publishers and consumers. The Data Quality Vocabulary defines concepts > such as measures and metrics to assess the quality for each quality > dimension [[VOCAB-DQV]]. There are heuristics designed to fit specific > assessment situations that rely on quality indicators, namely, pieces > of data content, pieces of data meta-information, and human ratings > that give indications about the suitability of data for some intended > use.</p> > > > >> >> We say "Data quality information will enable humans to know the quality >> of the dataset and its distributions, and software agents to >> automatically process quality information about the dataset and its >> distributions." That's rather tautological. We could say something about >> enabling humans to determine whether the dataset is suitable for their >> purposes. > > Annette and I are in agreement here. I'd phrase the intended outcome as: > > To enable people and software to assess the quality and therefore > suitability of a dataset for their application. > > >> >> We probably should refer to DQV as a finished thing, as it will be soon. > > +1 > > I suggest: > > <p>The machine readable version of the dataset quality metadata may > be provided using the Data Quality Vocabulary developed by the <abbr > title="Data on the Web Best Practices">DWBP</abbr> working group > [[VOCAB-DQV]]. </p> > > >> >> The human-readable example links to the metadata one. > > (which I think is deliberate?) > >> >> >> 8. Versioning >> -- > > Looking at the intro material I think I could probably find people to > argue that all three of those scenarios are simply corrections rather > than new versions. But then, as you say, there is no consensus :-) > > I would phrase the intended outcome as: > > To enable humans and software agents to easily determine which version > of a dataset they are working with. I'm not sure I like the approach of nonsentences (verb phrase only) for these. > >> >> Of the four implementation bullets, only the last is really a possible >> approach. The first three belong in the intended outcome. > > Unusually, I disagree with Annette here. For me, intended outcomes are > short "this is what will be possible." The implementation steps are > how you make it so, which I think you have in this case. > We'll need to make the styles for these consistent. But my concern is not about style but rather about content. If we are giving "guidelines" we are not giving a sample approach for how to implement those guidelines. > >> >> The human-readable example links to the metadata one. The version >> history there lists only 1.1, which is illogical. (1.0 must exist at >> least.) >> >> >> 9. Version history >> -- >> >> The human-readable example links to the metadata one. The version >> history there lists only 1.1, which is illogical. (1.0 must exist at >> least.) This example doesn't meet the requirements of the BP. >> >> Neither the ttl version nor the Memento example provides a full version >> history, only a list of versions released. This BP is intended to be >> about providing the details of what changed. > > Hmm... Not sure how we'd offer advice on providing machine readable > deltas. It's possible of course, but it's a bit of a stretch for our > current work and will be very context dependent. And there is more > info in Memento than I think is warranted. > > Sorry but I'm inclined to suggest that the second versioning BP is > dropped, but add into the first one something about at least including > free text description of what has changed between the various versions. > Don't throw the baby out with the bath water. We can toss the mention of machine-readable version history, but still advise people to include a human-readable one. >> >> >> Intro to Identifiers >> -- >> >> Intro item 5 refers to an API which could be confusing, since we talk >> about APIs as web APIs elsewhere. > > OK, so how about this (shorter) alternative: > > De-referencing a URI triggers a computer program to run on a server > that may do something as simple as return a single, static file, or it > may carry out complex processing. Precisely what processing is carried > out, i.e. the software on the server, is completely independent of the > URI itself. > > +1 >> >> >> 10. Persistent URIs as identifiers >> -- >> >> We say "This requires a different mindset to that used when creating a >> Web site designed for humans to navigate their way through." When >> creating a web site for humans to navigate, one should also consider >> persistence, so that sentence is not strictly accurate. > > I agree. OK, delete that sentence so it's just: > > To be persistent, URIs must be designed as such. A lot has been > written on this topic, see, for example, the European Commission's > Study on Persistent URIs [PURI] which in turn links to many other > resources. > > I'll come back to the remainder of Annette's comments tomorrow as I > can see they need more energy than I can muster this evening. > > Phil. > > > > >> >> The example uses the city domain instead of the transport agency's >> domain, which is not realistic for a large city. The agency domain is >> likely to persist as long as the information it makes available is >> relevant. Try Googling "transit agency" and see what comes up for domain >> names. The issue depends on how stable the transit service is. For a >> small town, the transit function might not be given over to a separate >> agency, and the guidance would be right, but for a big city, where the >> transit function is run by an independent agency, it's not realistic. >> >> The example is rather redundant. It is data.mycity..., and yet /dataset >> also appears in the path. The path also contains /bus as well as >> /bus-stops. It's unlikely that the agency has so many transit modes that >> they need to be split between road and rail and water. The same info is >> conveyed as well by the much shorter >> http://data.mycitytransit.example.org/bus/stops >> >> We say "Ideally, the relevant Web site includes a description of the >> process..." I think we mean a controlled scheme. >> >> >> 11. Persistent URIs within datasets >> -- >> >> The word "affordances" is misused. Affordances are how we know what >> something is intended to do, not what the thing does. Affordances do not >> act on things, they inform. >> >> The intended outcome should be a free-standing piece of text. Starting >> with "that one item" is confusing. >> >> Much of the implementation section is about minting new URIs, which is >> the subject of the previous BP. It is off topic here. Everything from >> "If you can't find an existing set of identifiers that meet your needs, >> you'll need to create your own" down to the end of the example doesn't >> belong in a BP that is about using other people's identifiers. >> >> The last paragraph of the example is almost exactly the same as the last >> paragraph before the example. >> >> >> 12. URIs for versions and series >> -- >> >> This BP is confusing two issues. One is the use of a shorter URI for the >> latest version of a dataset while also assigning a version-specific URI >> for it. The other issue is making a landing page for a collection of >> datasets. The initial intent was the former. >> >> The examples in the Why aren't series or groups except for the first >> item, yet they are introduced as examples of series or groups. >> >> How to Test says to check "that logical groups of datasets are also >> identifiable." That is vague. It should say "that a URI is also provided >> for the latest version or most recent real-time value." >> >> I don't think this applies to time series. What we're talking about here >> is use of dates for version identifiers. >> >> The example is incomplete; it doesn't say what the latest version URI >> would be. >> > -- Annette Greiner NERSC Data and Analytics Services Lawrence Berkeley National Laboratory
Received on Thursday, 21 April 2016 22:21:02 UTC