Re: partial review from Annette Greiner on 2016-04-22 (public-dwbp-wg@w3.org from April 2016)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Fri, 22 Apr 2016 14:46:47 -0700
To: Phil Archer <phila@w3.org>, DWBP Public List <public-dwbp-wg@w3.org>
Message-ID: <571A9BC7.9000000@lbl.gov>
Hi Phil, here are my comments on your comments on my comments. :)
-Annette

On 4/19/16 4:53 AM, Phil Archer wrote:
> OK, coming back to this...
>
>>
>> On 15/04/2016 03:12, Annette Greiner wrote:
> [..]
>
>>>
>>> 10. Persistent URIs as identifiers
>>> -- 
>
>>>
>>> The example uses the city domain instead of the transport agency's
>>> domain, which is not realistic for a large city. The agency domain is
>>> likely to persist as long as the information it makes available is
>>> relevant. Try Googling "transit agency" and see what comes up for 
>>> domain
>>> names. The issue depends on how stable the transit service is. For a
>>> small town, the transit function might not be given over to a separate
>>> agency, and the guidance would be right, but for a big city, where the
>>> transit function is run by an independent agency, it's not realistic.
>
> I've no doubt that's the case, but we do need to get people away from 
> putting agency names in domain names if they are to persist. Transport 
> agencies are forever changing their name. UK gov provides an example 
> of domain names tied to function rather than agency names.
>
Hm, that doesn't seem to be an issue with transit agencies this side of 
the Atlantic, but in any case the larger point is valid. I think I will 
just assume we are talking about a relatively small city and that it's 
reasonable to use the city's domain for this.
>>>
>>> The example is rather redundant. It is data.mycity..., and yet /dataset
>>> also appears in the path. The path also contains /bus as well as
>>> /bus-stops. It's unlikely that the agency has so many transit modes 
>>> that
>>> they need to be split between road and rail and water.
>
> True, but a government will. And a national science data service (like 
> Australia's ANDS) covers multiple disciplines.
>
A city government won't, and that's what we're talking about in our 
example. If we keep the word transport out of the domain, then I think 
that would logically go in the path, as you had it. How many road 
transit modes do you expect it to have? Maybe two or three? How many 
modes would it have total? Maybe ten tops? There's no benefit to adding 
an extra breakdown for the modes. It's generally better to keep these 
hierarchies relatively flat and keep URIs short, so that there is no 
issue of what directory something is in today. Extra segments in a URI 
just make it hard to remember and share.
>  The same info is
>>> conveyed as well by the much shorter
>>> http://data.mycitytransit.example.org/bus/stops
>
> Hmm, maybe.
>
> http://data.mycity.example.org/id/bus/stops/01
> would be an ID for a bus stop, by which I mean the physical thing.
>
> http://data.mycity.example.org/doc/bus/stops/01
>
> would be the ID for a document about the bus stop.

> http://data.mycity.example.org/dataset/bus/stops/
>
> Is the ID for a dataset describing all the bus stops.

> So I'd say that, OK, we could possibly remove the /public-transport 
> path segment, and /bus/, but the shorter the URI, the more chance 
> there is of it being a problem when a new dataset comes along.
>
> The URIs from this BP are used throughout the doc so changing them 
> would mean changing them all. Search and replace could take care of a 
> lot of that but there may be awkward edge cases in the doc.
>
> I leave it to the editors to decide whether to use the shorter URIs.
>
What we have now is already inconsistent, so this is work that needs to 
be done either way.
I think we could go with something like this:
http://data.mycity.example.org/transport/ is a base for all the example URIs
Probably a real link would need to identify the dataset somehow rather 
than just say that it's a dataset.
What do you think about this?
http://data.mycity.example.org/transport/timetables/bus/stops/

>>>
>>> We say "Ideally, the relevant Web site includes a description of the
>>> process..." I think we mean a controlled scheme.
>
> In context, I'd say the current text is OK:
>
> "Check that each dataset in question is identified using a URI that 
> has been assigned under a controlled process as set out in the 
> previous section. Ideally, the relevant Web site includes a 
> description of the process and a credible pledge of persistence should 
> the publisher no longer be able to maintain the URI space themselves."
>
The process for assigning a URI might be "the page editor suggests 
something, then the web site manager approves it." What I think you are 
talking about is a scheme.
>
>
>>>
>>>
>>> 11. Persistent URIs within datasets
>>> -- 
>>>
>>> The word "affordances" is misused. Affordances are how we know what
>>> something is intended to do, not what the thing does. Affordances do 
>>> not
>>> act on things, they inform.
>
> That's what comes of me writing text about something that, as you 
> instantly notice, I don't know a lot about.
>
> We could remove the word affordance altogether like:
>
> "These ideas are at the heart of the 5 Stars of Linked Data where one 
> data point links to another, and of Hypermedia where links may be to 
> further data or to services that can act on or relate to the data in 
> some way."
>
+1
>>>
>>> The intended outcome should be a free-standing piece of text. Starting
>>> with "that one item" is confusing.
>
> OK, I have removed the word 'that' so it just reads:
>
> One data item can be related to others across the Web creating a 
> global information space accessible to humans and machines alike.
>
+1, as long as you put a comma after "Web" ;^)
>
>>>
>>> Much of the implementation section is about minting new URIs, which is
>>> the subject of the previous BP. It is off topic here. Everything from
>>> "If you can't find an existing set of identifiers that meet your needs,
>>> you'll need to create your own" down to the end of the example doesn't
>>> belong in a BP that is about using other people's identifiers.
>
> Hmm, I disagree a little. The first BP is about persistent URIs for 
> datasets, the second about persistent URIs within the data itself. It 
> talks about using other people's sets for obvious things and then goes 
> in to aspects of URI design that are not covered, or relevant, to the 
> previous BP. There's info in that second one that I wouldn't like to 
> see lost but, since I wrote it, I am too close to be a good judge.
>
I don't mind the sentence that says to mint your own persistent IDs if 
you can't find existing ones. I think the info in the example about URIs 
with /id in them should be left out, though, as they are confusing to 
people who aren't familiar with semweb, and we are talking about posting 
data rather than creating URIs for things. The example could mention a 
usage of an existing URI for one element and usage of a self-made URI 
for some other element.
>
>>>
>>> The last paragraph of the example is almost exactly the same as the 
>>> last
>>> paragraph before the example.
>
> Correct. I have deleted it in my native speaker review copy.
>
>>>
>>>
>>> 12.  URIs for versions and series
>>> -- 
>
> My suggested rewording for the intended outcome:
>
> To enable references to a specific version of a dataset and to 
> concepts such as a 'dataset series' and 'the latest version'.
>
+1
>
>>>
>>> This BP is confusing two issues. One is the use of a shorter URI for 
>>> the
>>> latest version of a dataset while also assigning a version-specific URI
>>> for it. The other issue is making a landing page for a collection of
>>> datasets. The initial intent was the former.
>
> I don't see any reference to landing pages for collections of 
> datasets. I do think that the example could be improved slightly 
> though, like this:
>
The URI for a 'dataset series' would be a landing page for the series, no?
For example, a series like 'houses built in MyCity by decade' would not 
have a 'current' value, so it wouldn't make sense to have 
http://mycity.example.com/housing/new-houses/ point to anything other 
than an index of the available decadal datasets.
> <p>Suppose that a new bus stop is created. To keep 
> <code>bus-stops-2015-05-05 </code> up to date, a new version of the 
> dataset (<code>bus-stops-2015-12-17</code>) is created. 
> <code>bus-stops-2015-12-17 </code> includes all the data from 
> <code>bus-stops-2015-05-05 </code> plus the data about the new bus 
> stop. The two versions can be identified by the following URIs: </p>
> <p><code>http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05</code> 
> is the versioned URI of the first version of the dataset</p>
> <p><code>http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-12-17</code> 
> is the version URI of the updated version of the dataset</p>
> <p><code>http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops</code> 
> always resolves to the latest version so it pointed to  resolved to 
> <code>bus-stops-2015-05-05</code> <em>until</em> 17 December 2015 when 
> the server configuration was updated to point that URL to 
> <code>bus-stops-2015-12-17</code>.</p>
>
right on. +1
>>>
>>> The examples in the Why aren't series or groups except for the first
>>> item, yet they are introduced as examples of series or groups.
>
> True, I offer this as a better alternative:
>
> <ul>
>   <li>bus stops in my city (that change over time);</li>
>   <li>a list of elected officials in My City</li>
>   <li>evolving versions of a document through to completion.</li>
> </ul>
>
> I suggest this sentence " <p>In different circumstances, it will be 
> appropriate to refer
>               separately to each of these examples (and many like 
> them). </p>" is replaced with
>
> <p>In different circumstances, it will be appropriate to refer to the 
> current situation (the current set of bus stops, the current elected 
> officials etc.). In others, it may be appropriate to refer to the 
> situation as it exists/existed at a specific time.</p>
>
+1. I would just say existed rather than exists/existed, so the phrase 
contrasts clearly with the current situations.
>>>
>>> How to Test says to check "that logical groups of datasets are also
>>> identifiable." That is vague. It should say "that a URI is also 
>>> provided
>>> for the latest version or most recent real-time value."
>
> I would phrase it as:
>
> Check that each version of a dataset has its own URI, and that there 
> is also a 'latest version' URI.
>
>
+1
>>>
>>> I don't think this applies to time series. What we're talking about 
>>> here
>>> is use of dates for version identifiers.
>>>
>>> The example is incomplete; it doesn't say what the latest version URI
>>> would be.
>>>
>
> Yep, that's what I fixed above.
>
> OK, I think we're done with this round.
>
> Thanks again Annette - that kind of careful review is critical.
>
> Phil.
>

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
Received on Friday, 22 April 2016 21:47:14 UTC