Re: Data Identification section (was Re: reviewing the BP doc)

On 06/08/2015 21:19, Annette Greiner wrote:
> Hi Phil,
> Thanks for responding to my comments.
> Re the question of how to handle the URL/URI split, I suggest we just use URI uniformly. In fact, the section in question appears to do just that. As for IRI, I don’t see that appearing anywhere in the latest published draft. The DOI issue I raised goes away if we remove the introduction to URIs/URLs/IRIs.

OK, I'll use Occam's Razor to edit that out and offer a pull request.

> Re keeping the implementation suggestions to info about publishing data, I don’t see why you don’t see how we can avoid talking about everything else. I don’t think our document would suffer at all from the removal of a bullet point about using 303 redirects for real-world objects, like Alice Brown. We are not talking about publishing people on the web.

Oh, I think we might be. If I want to publish data about people I need 
an identifier for the person and the data about that person 
( And the two need to be 
related in some way. 303 is the usual way of doing that.

This is one of the hoary issues that the Spatial Data WG is going to 
need to address sooner or later. In the GIS world, they only have IDs 
for the digital object and have no identifier for the thing itself. That 
doesn't work on the Web.

That's why I think we do need to offer advice on identifiers: what 
exactly is being identified?

> I’m also noticing that other bullets bother me for other reasons. “Re-use existing identifiers” is one. I’m not sure what the intention is. Surely we don’t want to suggest that publishers use the same identifier for more than one data set.

No. The problem I have here is that I'm trying not to repeat the whole 
of earlier work on persistent URI design and just include the bullet points.

What I mean is, if you have an existing set of identifiers for things, 
maybe a set like Cenozoic, Mesozoic, Palaeozoic, Precambrian, then they 
should be reused in your URI scheme:

rather than

  “Link multiple representations” is another. I don’t see why we would 
recommend using the url rather than query string or content negotiation 
to indicate formats.

Conneg, yes, Query string, no. That's because query strings are 
typically tied to a specific technology (usually an SQL server) and 
therefore change when the underlying DB gets changed.

Path segments in URIs can be seen as parameters that can be translated 
into any number of different queries in any number of languages.

That rule disagrees directly with the last one, “avoid file extensions”.

Yeah - lack of context makes your statement true.

  If the intention is to remind publishers to make both representations 
available, that is in a different BP.


  I think I disagree with “Use a dedicated service (i.e., independent of 
the data originator)”, if I’m interpreting it correctly. If I publish 
data from Lawrence Berkeley National Laboratory, I think it is best 
practice for me to publish it on a server managed by the Laboratory.

So do I. But it should not include the name Lawrence Berkeley National 
Laboratory or the project you were working on when you did it so that if 
the lab changes its name, or when the project ends or any one of other 
circumstances arise, the domain name persists and the whole thing can be 
transferred to a new owner if needs be.

  I do think it’s wise to use a reliable service, if that’s the idea. 
Why do we say to “avoid version numbers” for data? There is a BP just 
below that says to assign URIs to dataset versions.

Again, a preçis too far. Separate IDs for separate versions *and* one 
for the series as a whole *and* one that always points to the latest 

  Autoincrement is often useful in databases, so data identifiers can 
easily end up being auto incremented.

No. Because when you update the database, all those numbers change and 
your auto increments no longer apply. See Europeana for details of 
exactly that happening.

I don’t see a problem with using them in URLs if they are the unique 
identifiers for data rows, though I agree that dates are better for 
identifying a dataset. Why do we say to avoid query strings? They are 
useful for requesting specific formats.  I understand the point in “cool 
URIs” about not tying a URL to a specific implementation (like .html or 
.php), but in the case of data in a specific format, it still makes 
sense. I think many of these ideas make more sense if considered in the 
context of assigning resource identifiers for things other than 
published data.

OK, I'm going to take a run at turning that generic intro material into 
discrete BPs.

The work I did on this (currently linked from the BP doc) is at and gives chapter and 
verse on all this. I'll put more effort in to transcribing the relevant 
bits into a form that the editors and WG may want to include in the BP doc.

Thanks for taking me to task on this Annette.


> -Annette
> --
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
> 510-495-2935
> On Aug 6, 2015, at 8:57 AM, Phil Archer <> wrote:
>> Hi Annette,
>> You make several comments here, I want to reply to one particular set, hence the change in subject.
>> On 19/06/2015 03:03, Annette Greiner wrote:
>> [..]
>>> Data Identification
>>> The introductory text about URIs and URLs and IRIs is potentially confusing and not necessary for our audience to understand the BPs about identifiers.
>> I disagree (which is why I wrote it of course!)
>> The three terms *are* confusing and I was attempting to clear that up. My reason being that we do talk about URLs and URIs and they're not interchangeable. A few, a very few, will talk about IRIs. Anyone dipping a toe in reading a W3C spec these days will see that rare term and wonder what the heck it means.
>> Do you think it's worth me having another shot at explaining the differences or are you opposed to including any such explanation?
>> Also, URLs are for for the internet, not just the web.
>> That's not my understanding although I guess it's not an absolute distinction. To take an example of an Internet service that is not on the Web, Skype doesn't use URLs except to address servers, the actual data is not transmitted using HTTP.
>> I also disagree with the representation of DOIs as something that cannot be looked up, though the question is not something I think we should make readers think about.
>> Hobby horse alert!
>> To look up doi:10.1103/PhysRevD.89.032002 you have to:
>> - strip the doi: scheme;
>> - choose a resolver service (that you have to already know about);
>> - append the remaining string to that base URL to get something like
>> - use HTTP to dereference it.
>> If you choose a different base URI and you might get something very different ( for example ;-) )
>> My intention when I included that was to point out that other identifier schemes, DOIs being one of the best known, are not dereferenceable and not (natively) part of the Web.
>>> * I would like this section to limit itself to information that applies to publishing *data*.
>> It's about identifiers and identifiers are dumb strings, therefore I can't see how we can talk about identifiers that only apply to data and not everything else.
>> The BP is about assigning persistent identifiers to datasets, but the possible approach to implementation is about much more than that.
>> Yes, but that's for the reason just given.
>> The list items are also not consistent. (one shows use of extensions, another says not to do that).
>> Fair enough, yes, I'd need to expand that and tie it back to the multiple formats BP. I'd want to say something along the lines of:
>> Use an identifier like to link to the resource.
>> Only include the file extension if it refers to a specific representation of that resource, like
>> (btw, a feature of's server set up is that we don't need to include file extensions. A URL like actually returns a .php file (you can add the extension of you like) ). We make a lot of use of conneg.
>> I worry that this will open up a holy war about how to implement a REST API.
>> OK, that we want to avoid and it's being dealt with in another thread. But I am prepared to defend the general principles here - it's what marks out the Web as a data platform and not a means of transmitting datasets that could just as easily be transported by sending a USB stick in the post.
>> Phil.
>> For tracker: this is issue-194
>> --
>> Phil Archer
>> W3C Data Activity Lead
>> +44 (0)7887 767755
>> @philarcher1


Phil Archer
W3C Data Activity Lead
+44 (0)7887 767755

Received on Friday, 7 August 2015 12:01:50 UTC