Re: Data Identification section (was Re: reviewing the BP doc)

Hi Tomas,

Pls see below.

On 12/08/2015 14:48, wrote:
> Phil,
> Allow me to re-launch the proposal to have a separated document for URI so it can be properly addressed: shades of this question comes back from time to time. The proposal has been ready for nearly a year:
>   Compact Uniform Resource Identifier (COMURI)
> The URI section in the DWBP document could contain an overview. This are the main points:
> * URI
> URI identify resources ('dumb strings')

With you so far.

> * Variant
> Resource may have several variants (representations)

Yep - language, format etc.

> * Dimension
> Variants dimensions are the types of representations, such as languages and formats

> * Direct identification
> Describe direct identification of variants for language and format


> * Mnemonic
> URIs should be compact and mnemonic: human and machine friendly


Sorry, I can't find where but I'm pretty sure I've gone on record before 
now in not agreeing with you on this.

True - URIs should be no longer than necessary, but saying they should 
always be short is dangerous and leads to ephemeral identifiers.

In the 1970s, would probably have referred to 
the infamous European wine lake, or milk lake etc. Now it would be 
something to do with INSPIRE.

Better to have

and, for something as inherently complex as geospatial data:{theme}/{class}/{inspireNamespaceId}/{inspireLocalId}[/{inspireVersionId}]

(this from

Of course that's a much longer string and in some circumstances is 
therefore more awkward to work with but by using path segments to 
effectively classify the identified thing, it's more likely to persist.

> * ?R?
> URI, URL, URN, IRI. Just use URI everywhere and add something like:
>    "In this specification, the term URI is used for the identification schemes: URI, URL, URN and IRI ..."
> This is line with the recommendation in RFC3986
>    " ... Future specifications and related documentation should use the general term "URI" rather than the more restrictive terms "URL" and "URN" ..."

But we *want* to be restrictive. We're only talking about HTTP URIs, 
we're not talking about URNs, or even URLs. Hence I think we need to say 
something, no?

In terms of the long history of your paper, it has been reviewed and 
discussed many times.

For example

At TPAC last year it was resolved [1] that "all of our best practices 
will be in one document, because it is easier for readers/implementers 
to understand and easier for editors/contributors to manage"

That followed a discussion on the COMURI paper.

See closed action items and the threads they link t:

My suggestion, which is the same as my suggestion to the authors of the 
enrichment paper and is consistent with the resolutions already passed, 
is that we incorporate your text within the BP doc.

So, for example,

The dumb strings aspect. I put that in the intro to the section on Data 
Identification - maybe you can suggest something better? Either in the 
intro, as I have suggested, or in a BP.

All the stuff about variants and dimensions, that goes in the BP about 
multiple formats I think. I started writing a suggestion for the example 
section of that one last night but stopped so as not to confuse the 
changes I'm suggesting for Data Identification. I was thinking of using 
the Ordnance Survey data. We can take any location but, for the sake of 
argument, let's take my favourite part of the world:

That data is available in HTML, JSON, XML and RDF/XML. Disappointingly, 
it doesn't include labels in both English and Welsh but you can't have 
everything. is in English and Chinese 
(.html.en and .html.zh-hans) but that's not data in the sense we mean of 

Oh and I don I'm *really* hoping we don't need to say anything about 
URIs for real world objects cf. data about them. Can we word the BP on 
multiple formats by saying that

provides data about the Welsh county of Pembrokeshire without saying that is an identifier 
for the place itself?

but we may have to face this (if not here, it's looming in the spatial 
data WG).




> Regards
> Tomas
> -----Original Message-----
> From: Phil Archer []
> Sent: Wednesday, August 12, 2015 1:51 PM
> To: 'Data on the Web Best Practices Working Group'
> Subject: Re: Data Identification section (was Re: reviewing the BP doc)
> I've had a go at making changes based on this conversation. See
> I have:
> 1. Reduced the intro text. I find it hard to make no mention at all of
> what we mean by URI. If the new first paragraph is still unacceptable,
> then, OK, it will have to go, but we - like a lot of people - use the
> term URI when what we actually mean is HTTP IRI and, in a formal
> document, I feel we should have a way of saying "we know this is a
> simplification."
> I have retained the numbered list of points about URIs. I know it's more
> tutorial than specification but, well, IMO it needs saying. Point 3 is
> particularly relevant to what comes later on multiple formats.
> 2. For the BP "Use persistent URIs to identify datasets" I've removed
> the bulleted list entirely and I'm suggesting replacing it with a
> modified version of a table from the URI persistence study I did a
> couple of years ago. It shows a number of documents on the topic.
> Better? Worse? Too much?
> 3. I added in a whole new BP on URIs as identifiers within datasets.
> This includes examples of URIs that end in existing IDs.
> There's a close correlation between the data ID stuff and the BP about
> providing data in multiple formats - I'll write a separate mail about
> that to keep things manageable.
> Phil.
> On 07/08/2015 13:16, Makx Dekkers wrote:
>> On the topic of URL/URI/IRI, I think the current text is a bit out of scope.
>> In a way, BP10 is a 'best practice' for minting and maintaining URIs, not
>> for how to publish data on the web. And, frankly, I think the current
>> introduction in section 9.7 is very confusing.
>> As far as I see it, the aspects of identification that concern the data are
>> that publishers should (a) assign URIs to datasets and any bits of data that
>> people may want to access and (b) define a persistence policy for the URIs
>> and the data. For the specifics of how to mint and maintain URIs, a couple
>> of references to external documents could be included.
>> I would suggest not to go into the differences and overlaps between URL and
>> URI -- that will only confuse people. I agree with Annette that it would be
>> sensible to just use 'URI' in the document; people can then look at external
>> references if they are interested in these other acronyms.
>> Makx.
>>> -----Original Message-----
>>> From: Annette Greiner []
>>> Sent: 6 August 2015 22:20
>>> To: Phil Archer <>
>>> Cc: Data on the Web Best Practices Working Group <public-dwbp-
>>> Subject: Re: Data Identification section (was Re: reviewing the BP doc)
>>> Hi Phil,
>>> Thanks for responding to my comments.
>>> Re the question of how to handle the URL/URI split, I suggest we just use
>> URI
>>> uniformly. In fact, the section in question appears to do just that. As
>> for IRI, I
>>> don't see that appearing anywhere in the latest published draft. The DOI
>>> issue I raised goes away if we remove the introduction to URIs/URLs/IRIs.
>>> Re keeping the implementation suggestions to info about publishing data, I
>>> don't see why you don't see how we can avoid talking about everything
>> else.
>>> I don't think our document would suffer at all from the removal of a
>> bullet
>>> point about using 303 redirects for real-world objects, like Alice Brown.
>> We
>>> are not talking about publishing people on the web.
>>> I'm also noticing that other bullets bother me for other reasons. "Re-use
>>> existing identifiers" is one. I'm not sure what the intention is. Surely
>> we don't
>>> want to suggest that publishers use the same identifier for more than one
>>> data set. "Link multiple representations" is another. I don't see why we
>>> would recommend using the url rather than query string or content
>>> negotiation to indicate formats. That rule disagrees directly with the
>> last one,
>>> "avoid file extensions".  If the intention is to remind publishers to make
>> both
>>> representations available, that is in a different BP. I think I disagree
>> with "Use
>>> a dedicated service (i.e., independent of the data originator)", if I'm
>>> interpreting it correctly. If I publish data from Lawrence Berkeley
>> National
>>> Laboratory, I think it is best practice for me to publish it on a server
>> managed
>>> by the Laboratory. I do think it's wise to use a reliable service, if
>> that's the
>>> idea. Why do we say to "avoid version numbers" for data? There is a BP
>> just
>>> below that says to assign URIs to dataset versions. Autoincrement is often
>>> useful in databases, so data identifiers can easily end up being auto
>>> incremented. I don't see a problem with using them in URLs if they are the
>>> unique identifiers for data rows, though I agree that dates are better for
>>> identifying a dataset. Why do we say to avoid query strings? They are
>> useful
>>> for requesting specific formats.  I understand the point in "cool URIs"
>> about
>>> not tying a URL to a specific implementation (like .html or .php), but in
>> the
>>> case of data in a specific format, it still makes sense. I think many of
>> these
>>> ideas make more sense if considered in the context of assigning resource
>>> identifiers for things other than published data.
>>> -Annette
>>> --
>>> Annette Greiner
>>> NERSC Data and Analytics Services
>>> Lawrence Berkeley National Laboratory
>>> 510-495-2935
>>> On Aug 6, 2015, at 8:57 AM, Phil Archer <> wrote:
>>>> Hi Annette,
>>>> You make several comments here, I want to reply to one particular set,
>>> hence the change in subject.
>>>> On 19/06/2015 03:03, Annette Greiner wrote:
>>>> [..]
>>>>> Data Identification
>>>>> The introductory text about URIs and URLs and IRIs is potentially
>>> confusing and not necessary for our audience to understand the BPs about
>>> identifiers.
>>>> I disagree (which is why I wrote it of course!)
>>>> The three terms *are* confusing and I was attempting to clear that up.
>> My
>>> reason being that we do talk about URLs and URIs and they're not
>>> interchangeable. A few, a very few, will talk about IRIs. Anyone dipping a
>> toe
>>> in reading a W3C spec these days will see that rare term and wonder what
>>> the heck it means.
>>>> Do you think it's worth me having another shot at explaining the
>>> differences or are you opposed to including any such explanation?
>>>> Also, URLs are for for the internet, not just the web.
>>>> That's not my understanding although I guess it's not an absolute
>>> distinction. To take an example of an Internet service that is not on the
>> Web,
>>> Skype doesn't use URLs except to address servers, the actual data is not
>>> transmitted using HTTP.
>>>> I also disagree with the representation of DOIs as something that cannot
>> be
>>> looked up, though the question is not something I think we should make
>>> readers think about.
>>>> Hobby horse alert!
>>>> To look up doi:10.1103/PhysRevD.89.032002 you have to:
>>>> - strip the doi: scheme;
>>>> - choose a resolver service (that you have to already know about);
>>>> - append the remaining string to that base URL to get something like
>>>> - use HTTP to dereference it.
>>>> If you choose a different base URI and you might get something very
>>>> different ( for
>>>> example ;-) )
>>>> My intention when I included that was to point out that other identifier
>>> schemes, DOIs being one of the best known, are not dereferenceable and
>>> not (natively) part of the Web.
>>>>> * I would like this section to limit itself to information that applies
>> to
>>> publishing *data*.
>>>> It's about identifiers and identifiers are dumb strings, therefore I
>> can't see
>>> how we can talk about identifiers that only apply to data and not
>> everything
>>> else.
>>>> The BP is about assigning persistent identifiers to datasets, but the
>> possible
>>> approach to implementation is about much more than that.
>>>> Yes, but that's for the reason just given.
>>>> The list items are also not consistent. (one shows use of extensions,
>>> another says not to do that).
>>>> Fair enough, yes, I'd need to expand that and tie it back to the
>> multiple
>>> formats BP. I'd want to say something along the lines of:
>>>> Use an identifier like to link to
>> the
>>> resource.
>>>> Only include the file extension if it refers to a specific
>>>> representation of that resource, like
>>>> (btw, a feature of's server set up is that we don't need to
>> include
>>> file extensions. A URL like
>>> psi/workshop/krems/report actually returns a .php file (you can add the
>>> extension of you like) ). We make a lot of use of conneg.
>>>> I worry that this will open up a holy war about how to implement a REST
>> API.
>>>> OK, that we want to avoid and it's being dealt with in another thread.
>> But I
>>> am prepared to defend the general principles here - it's what marks out
>> the
>>> Web as a data platform and not a means of transmitting datasets that could
>>> just as easily be transported by sending a USB stick in the post.
>>>> Phil.
>>>> For tracker: this is issue-194
>>>> --
>>>> Phil Archer
>>>> W3C Data Activity Lead
>>>> +44 (0)7887 767755
>>>> @philarcher1


Phil Archer
W3C Data Activity Lead
+44 (0)7887 767755

Received on Wednesday, 12 August 2015 14:57:00 UTC