RE: Data Identification section (was Re: reviewing the BP doc) from Manuel.CARRASCO-BENITEZ@ec.europa.eu on 2015-08-12 (public-dwbp-wg@w3.org from August 2015)

From: <Manuel.CARRASCO-BENITEZ@ec.europa.eu>
Date: Wed, 12 Aug 2015 13:48:43 +0000
To: <phila@w3.org>, <public-dwbp-wg@w3.org>
Message-ID: <39DB516E46C0E842A2CFFF1BBB7412F16F30EAA0@S-DC-ESTF03-B.net1.cec.eu.int>
Phil,

Allow me to re-launch the proposal to have a separated document for URI so it can be properly addressed: shades of this question comes back from time to time. The proposal has been ready for nearly a year:

 Compact Uniform Resource Identifier (COMURI)
 http://dragoman.org/comuri

The URI section in the DWBP document could contain an overview. This are the main points:

* URI
URI identify resources ('dumb strings')

* Variant
Resource may have several variants (representations)

* Dimension
Variants dimensions are the types of representations, such as languages and formats 

* Direct identification
Describe direct identification of variants for language and format
  http://dragoman.org/comuri.html#dvar

* Mnemonic
URIs should be compact and mnemonic: human and machine friendly

* ?R?
URI, URL, URN, IRI. Just use URI everywhere and add something like:

  "In this specification, the term URI is used for the identification schemes: URI, URL, URN and IRI ..."

This is line with the recommendation in RFC3986
https://tools.ietf.org/html/rfc3986#section-1.1.3

  " ... Future specifications and related documentation should use the general term "URI" rather than the more restrictive terms "URL" and "URN" ..."


Regards
Tomas


-----Original Message-----
From: Phil Archer [mailto:phila@w3.org] 
Sent: Wednesday, August 12, 2015 1:51 PM
To: 'Data on the Web Best Practices Working Group'
Subject: Re: Data Identification section (was Re: reviewing the BP doc)

I've had a go at making changes based on this conversation. See 
http://philarcher1.github.io/dwbp/bp.html#DataIdentification

I have:

1. Reduced the intro text. I find it hard to make no mention at all of 
what we mean by URI. If the new first paragraph is still unacceptable, 
then, OK, it will have to go, but we - like a lot of people - use the 
term URI when what we actually mean is HTTP IRI and, in a formal 
document, I feel we should have a way of saying "we know this is a 
simplification."

I have retained the numbered list of points about URIs. I know it's more 
tutorial than specification but, well, IMO it needs saying. Point 3 is 
particularly relevant to what comes later on multiple formats.

2. For the BP "Use persistent URIs to identify datasets" I've removed 
the bulleted list entirely and I'm suggesting replacing it with a 
modified version of a table from the URI persistence study I did a 
couple of years ago. It shows a number of documents on the topic.

Better? Worse? Too much?

3. I added in a whole new BP on URIs as identifiers within datasets. 
This includes examples of URIs that end in existing IDs.

There's a close correlation between the data ID stuff and the BP about 
providing data in multiple formats - I'll write a separate mail about 
that to keep things manageable.

Phil.


On 07/08/2015 13:16, Makx Dekkers wrote:
> On the topic of URL/URI/IRI, I think the current text is a bit out of scope.
> In a way, BP10 is a 'best practice' for minting and maintaining URIs, not
> for how to publish data on the web. And, frankly, I think the current
> introduction in section 9.7 is very confusing.
>
> As far as I see it, the aspects of identification that concern the data are
> that publishers should (a) assign URIs to datasets and any bits of data that
> people may want to access and (b) define a persistence policy for the URIs
> and the data. For the specifics of how to mint and maintain URIs, a couple
> of references to external documents could be included.
>
> I would suggest not to go into the differences and overlaps between URL and
> URI -- that will only confuse people. I agree with Annette that it would be
> sensible to just use 'URI' in the document; people can then look at external
> references if they are interested in these other acronyms.
>
> Makx.
>
>
>> -----Original Message-----
>> From: Annette Greiner [mailto:amgreiner@lbl.gov]
>> Sent: 6 August 2015 22:20
>> To: Phil Archer <phila@w3.org>
>> Cc: Data on the Web Best Practices Working Group <public-dwbp-
>> wg@w3.org>
>> Subject: Re: Data Identification section (was Re: reviewing the BP doc)
>>
>> Hi Phil,
>> Thanks for responding to my comments.
>> Re the question of how to handle the URL/URI split, I suggest we just use
> URI
>> uniformly. In fact, the section in question appears to do just that. As
> for IRI, I
>> don't see that appearing anywhere in the latest published draft. The DOI
>> issue I raised goes away if we remove the introduction to URIs/URLs/IRIs.
>> Re keeping the implementation suggestions to info about publishing data, I
>> don't see why you don't see how we can avoid talking about everything
> else.
>> I don't think our document would suffer at all from the removal of a
> bullet
>> point about using 303 redirects for real-world objects, like Alice Brown.
> We
>> are not talking about publishing people on the web.
>>
>> I'm also noticing that other bullets bother me for other reasons. "Re-use
>> existing identifiers" is one. I'm not sure what the intention is. Surely
> we don't
>> want to suggest that publishers use the same identifier for more than one
>> data set. "Link multiple representations" is another. I don't see why we
>> would recommend using the url rather than query string or content
>> negotiation to indicate formats. That rule disagrees directly with the
> last one,
>> "avoid file extensions".  If the intention is to remind publishers to make
> both
>> representations available, that is in a different BP. I think I disagree
> with "Use
>> a dedicated service (i.e., independent of the data originator)", if I'm
>> interpreting it correctly. If I publish data from Lawrence Berkeley
> National
>> Laboratory, I think it is best practice for me to publish it on a server
> managed
>> by the Laboratory. I do think it's wise to use a reliable service, if
> that's the
>> idea. Why do we say to "avoid version numbers" for data? There is a BP
> just
>> below that says to assign URIs to dataset versions. Autoincrement is often
>> useful in databases, so data identifiers can easily end up being auto
>> incremented. I don't see a problem with using them in URLs if they are the
>> unique identifiers for data rows, though I agree that dates are better for
>> identifying a dataset. Why do we say to avoid query strings? They are
> useful
>> for requesting specific formats.  I understand the point in "cool URIs"
> about
>> not tying a URL to a specific implementation (like .html or .php), but in
> the
>> case of data in a specific format, it still makes sense. I think many of
> these
>> ideas make more sense if considered in the context of assigning resource
>> identifiers for things other than published data.
>> -Annette
>> --
>> Annette Greiner
>> NERSC Data and Analytics Services
>> Lawrence Berkeley National Laboratory
>> 510-495-2935
>>
>> On Aug 6, 2015, at 8:57 AM, Phil Archer <phila@w3.org> wrote:
>>
>>> Hi Annette,
>>>
>>> You make several comments here, I want to reply to one particular set,
>> hence the change in subject.
>>>
>>>
>>> On 19/06/2015 03:03, Annette Greiner wrote:
>>> [..]
>>>
>>>
>>>> Data Identification
>>>> The introductory text about URIs and URLs and IRIs is potentially
>> confusing and not necessary for our audience to understand the BPs about
>> identifiers.
>>>
>>> I disagree (which is why I wrote it of course!)
>>>
>>> The three terms *are* confusing and I was attempting to clear that up.
> My
>> reason being that we do talk about URLs and URIs and they're not
>> interchangeable. A few, a very few, will talk about IRIs. Anyone dipping a
> toe
>> in reading a W3C spec these days will see that rare term and wonder what
>> the heck it means.
>>>
>>> Do you think it's worth me having another shot at explaining the
>> differences or are you opposed to including any such explanation?
>>>
>>>
>>>
>>>
>>> Also, URLs are for for the internet, not just the web.
>>>
>>> That's not my understanding although I guess it's not an absolute
>> distinction. To take an example of an Internet service that is not on the
> Web,
>> Skype doesn't use URLs except to address servers, the actual data is not
>> transmitted using HTTP.
>>>
>>> I also disagree with the representation of DOIs as something that cannot
> be
>> looked up, though the question is not something I think we should make
>> readers think about.
>>>
>>> Hobby horse alert!
>>>
>>> To look up doi:10.1103/PhysRevD.89.032002 you have to:
>>>
>>> - strip the doi: scheme;
>>>
>>> - choose a resolver service (that you have to already know about);
>>>
>>> - append the remaining string to that base URL to get something like
>>> http://dx.doi.org/10.1103/PhysRevD.89.032002
>>>
>>> - use HTTP to dereference it.
>>>
>>> If you choose a different base URI and you might get something very
>>> different (http://philarcher.org/10.1103/PhysRevD.89.032002 for
>>> example ;-) )
>>>
>>> My intention when I included that was to point out that other identifier
>> schemes, DOIs being one of the best known, are not dereferenceable and
>> not (natively) part of the Web.
>>>
>>>
>>>> * I would like this section to limit itself to information that applies
> to
>> publishing *data*.
>>>
>>> It's about identifiers and identifiers are dumb strings, therefore I
> can't see
>> how we can talk about identifiers that only apply to data and not
> everything
>> else.
>>>
>>>
>>> The BP is about assigning persistent identifiers to datasets, but the
> possible
>> approach to implementation is about much more than that.
>>>
>>> Yes, but that's for the reason just given.
>>>
>>> The list items are also not consistent. (one shows use of extensions,
>> another says not to do that).
>>>
>>> Fair enough, yes, I'd need to expand that and tie it back to the
> multiple
>> formats BP. I'd want to say something along the lines of:
>>>
>>> Use an identifier like http://data.example.org/doc/foo/bar to link to
> the
>> resource.
>>>
>>> Only include the file extension if it refers to a specific
>>> representation of that resource, like
>>> http://data.example.org/doc/foo/bar.rdf
>>> http://data.example.org/doc/foo/bar.html
>>>
>>> (btw, a feature of w3.org's server set up is that we don't need to
> include
>> file extensions. A URL like http://www.w3.org/2013/share-
>> psi/workshop/krems/report actually returns a .php file (you can add the
>> extension of you like) ). We make a lot of use of conneg.
>>>
>>>
>>> I worry that this will open up a holy war about how to implement a REST
> API.
>>>
>>> OK, that we want to avoid and it's being dealt with in another thread.
> But I
>> am prepared to defend the general principles here - it's what marks out
> the
>> Web as a data platform and not a means of transmitting datasets that could
>> just as easily be transported by sending a USB stick in the post.
>>>
>>> Phil.
>>>
>>>
>>> For tracker: this is issue-194
>>>
>>>
>>> --
>>>
>>>
>>> Phil Archer
>>> W3C Data Activity Lead
>>> http://www.w3.org/2013/data/
>>>
>>> http://philarcher.org
>>> +44 (0)7887 767755
>>> @philarcher1
>
>
>
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Wednesday, 12 August 2015 13:49:17 UTC