RE: Data Identification section (was Re: reviewing the BP doc)

On the topic of URL/URI/IRI, I think the current text is a bit out of scope.
In a way, BP10 is a 'best practice' for minting and maintaining URIs, not
for how to publish data on the web. And, frankly, I think the current
introduction in section 9.7 is very confusing.

As far as I see it, the aspects of identification that concern the data are
that publishers should (a) assign URIs to datasets and any bits of data that
people may want to access and (b) define a persistence policy for the URIs
and the data. For the specifics of how to mint and maintain URIs, a couple
of references to external documents could be included.

I would suggest not to go into the differences and overlaps between URL and
URI -- that will only confuse people. I agree with Annette that it would be
sensible to just use 'URI' in the document; people can then look at external
references if they are interested in these other acronyms.


> -----Original Message-----
> From: Annette Greiner []
> Sent: 6 August 2015 22:20
> To: Phil Archer <>
> Cc: Data on the Web Best Practices Working Group <public-dwbp-
> Subject: Re: Data Identification section (was Re: reviewing the BP doc)
> Hi Phil,
> Thanks for responding to my comments.
> Re the question of how to handle the URL/URI split, I suggest we just use
> uniformly. In fact, the section in question appears to do just that. As
for IRI, I
> don't see that appearing anywhere in the latest published draft. The DOI
> issue I raised goes away if we remove the introduction to URIs/URLs/IRIs.
> Re keeping the implementation suggestions to info about publishing data, I
> don't see why you don't see how we can avoid talking about everything
> I don't think our document would suffer at all from the removal of a
> point about using 303 redirects for real-world objects, like Alice Brown.
> are not talking about publishing people on the web.
> I'm also noticing that other bullets bother me for other reasons. "Re-use
> existing identifiers" is one. I'm not sure what the intention is. Surely
we don't
> want to suggest that publishers use the same identifier for more than one
> data set. "Link multiple representations" is another. I don't see why we
> would recommend using the url rather than query string or content
> negotiation to indicate formats. That rule disagrees directly with the
last one,
> "avoid file extensions".  If the intention is to remind publishers to make
> representations available, that is in a different BP. I think I disagree
with "Use
> a dedicated service (i.e., independent of the data originator)", if I'm
> interpreting it correctly. If I publish data from Lawrence Berkeley
> Laboratory, I think it is best practice for me to publish it on a server
> by the Laboratory. I do think it's wise to use a reliable service, if
that's the
> idea. Why do we say to "avoid version numbers" for data? There is a BP
> below that says to assign URIs to dataset versions. Autoincrement is often
> useful in databases, so data identifiers can easily end up being auto
> incremented. I don't see a problem with using them in URLs if they are the
> unique identifiers for data rows, though I agree that dates are better for
> identifying a dataset. Why do we say to avoid query strings? They are
> for requesting specific formats.  I understand the point in "cool URIs"
> not tying a URL to a specific implementation (like .html or .php), but in
> case of data in a specific format, it still makes sense. I think many of
> ideas make more sense if considered in the context of assigning resource
> identifiers for things other than published data.
> -Annette
> --
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
> 510-495-2935
> On Aug 6, 2015, at 8:57 AM, Phil Archer <> wrote:
> > Hi Annette,
> >
> > You make several comments here, I want to reply to one particular set,
> hence the change in subject.
> >
> >
> > On 19/06/2015 03:03, Annette Greiner wrote:
> > [..]
> >
> >
> >> Data Identification
> >> The introductory text about URIs and URLs and IRIs is potentially
> confusing and not necessary for our audience to understand the BPs about
> identifiers.
> >
> > I disagree (which is why I wrote it of course!)
> >
> > The three terms *are* confusing and I was attempting to clear that up.
> reason being that we do talk about URLs and URIs and they're not
> interchangeable. A few, a very few, will talk about IRIs. Anyone dipping a
> in reading a W3C spec these days will see that rare term and wonder what
> the heck it means.
> >
> > Do you think it's worth me having another shot at explaining the
> differences or are you opposed to including any such explanation?
> >
> >
> >
> >
> > Also, URLs are for for the internet, not just the web.
> >
> > That's not my understanding although I guess it's not an absolute
> distinction. To take an example of an Internet service that is not on the
> Skype doesn't use URLs except to address servers, the actual data is not
> transmitted using HTTP.
> >
> > I also disagree with the representation of DOIs as something that cannot
> looked up, though the question is not something I think we should make
> readers think about.
> >
> > Hobby horse alert!
> >
> > To look up doi:10.1103/PhysRevD.89.032002 you have to:
> >
> > - strip the doi: scheme;
> >
> > - choose a resolver service (that you have to already know about);
> >
> > - append the remaining string to that base URL to get something like
> >
> >
> > - use HTTP to dereference it.
> >
> > If you choose a different base URI and you might get something very
> > different ( for
> > example ;-) )
> >
> > My intention when I included that was to point out that other identifier
> schemes, DOIs being one of the best known, are not dereferenceable and
> not (natively) part of the Web.
> >
> >
> >> * I would like this section to limit itself to information that applies
> publishing *data*.
> >
> > It's about identifiers and identifiers are dumb strings, therefore I
can't see
> how we can talk about identifiers that only apply to data and not
> else.
> >
> >
> > The BP is about assigning persistent identifiers to datasets, but the
> approach to implementation is about much more than that.
> >
> > Yes, but that's for the reason just given.
> >
> > The list items are also not consistent. (one shows use of extensions,
> another says not to do that).
> >
> > Fair enough, yes, I'd need to expand that and tie it back to the
> formats BP. I'd want to say something along the lines of:
> >
> > Use an identifier like to link to
> resource.
> >
> > Only include the file extension if it refers to a specific
> > representation of that resource, like
> >
> >
> >
> > (btw, a feature of's server set up is that we don't need to
> file extensions. A URL like
> psi/workshop/krems/report actually returns a .php file (you can add the
> extension of you like) ). We make a lot of use of conneg.
> >
> >
> > I worry that this will open up a holy war about how to implement a REST
> >
> > OK, that we want to avoid and it's being dealt with in another thread.
But I
> am prepared to defend the general principles here - it's what marks out
> Web as a data platform and not a means of transmitting datasets that could
> just as easily be transported by sending a USB stick in the post.
> >
> > Phil.
> >
> >
> > For tracker: this is issue-194
> >
> >
> > --
> >
> >
> > Phil Archer
> > W3C Data Activity Lead
> >
> >
> >
> > +44 (0)7887 767755
> > @philarcher1

Received on Friday, 7 August 2015 12:17:28 UTC