W3C home > Mailing lists > Public > public-dwbp-wg@w3.org > August 2015

RE: Data Identification section (was Re: reviewing the BP doc)

From: <Manuel.CARRASCO-BENITEZ@ec.europa.eu>
Date: Thu, 13 Aug 2015 08:37:39 +0000
To: <phila@w3.org>, <public-dwbp-wg@w3.org>
Message-ID: <39DB516E46C0E842A2CFFF1BBB7412F16F30ED81@S-DC-ESTF03-B.net1.cec.eu.int>
Phil,

* URI length
Fine with "... URIs should be no longer that necessary ..." and your examples are cases where having longer URIs is justifiable. But this is a BP and readers should expect specific guidelines. For example:

 - Opaque         http://example.com/1234
 - Mnemonic     http://example.com/2015-08-13
 - Hierachical    http://example.eu/myth/lakes/wine
 - Parametral    http://location.data.gov.uk/so/{theme}/{class}/{inspireNamespaceId}/{inspireLocalId}[/{inspireVersionId}]

Also, there should be guidelines on avoiding unnecessary file extensions:
 - Best variant          http://example.com/1234
 - The PDF variant   http://example.com/1234.pdf

Path segment should be only use to express a hierarchy.


* URL
"HTTP URIs" are URLs: they contain a location. Hence: just use URL and forget URI.


* Paper
If "all of our best practices will be in one document ..." is a dogma, the discussion is closed. One can just expand the appropriate sections in the BP document.


The data identification and data should be specified separately: you are fully aware from before starting the group that my main interest was on data identification; if the consensus is that this is out of scope, fine. But stating that it has been reviewed and discussed many times is over-stretching a few posts that were answered:

https://lists.w3.org/Archives/Public/public-dwbp-wg/2014Oct/0051.html
https://lists.w3.org/Archives/Public/public-dwbp-wg/2014Oct/0054.html
https://lists.w3.org/Archives/Public/public-dwbp-wg/2014Oct/0063.html

I might over-aware that data is a hard field and I focused on the simpler data identification: I shadowed for a while Long-Term Archive and Notary Services (LTANS)
  https://tools.ietf.org/html/rfc4810


* Ordnance Survey
One should go for a URL without "doc" or "id".  Widely used services should have a shorter dedicated domains.

  http://data.ordnancesurvey.co.uk/7000000000025490

This is a URL (a URI with a location mechanism) that identifies a resource that might be abstract or physical, whatever that author decides. Resources could have several variants and metadata.

- Data about the place : as an abstract variant or metadata
- Place itself                 : a physical variant

RFC-3986 states:
 https://tools.ietf.org/html/rfc3986

 "A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an *abstract* or *physical* resource."

As long as I am aware, variants of a same resource can be of different natures (abstract or physical).

<half_joke>
My teacher did not expected that the theological studies would be applicable to informatics (it did not existed at the time):
 https://en.wikipedia.org/wiki/Miaphysite
</half_joke>

COMURI proposes a mechanism for direct metadata request, fully respecting the existing standards:
 http://dragoman.org/comuri.html#uri-metadata

  http://example.com/foo? # "URI metadata request" using the "empty string"
  file:///foo.comuri.html # "metadata file" using a comuri metadata file

Regards
Tomas
________________________________________
From: Phil Archer [phila@w3.org]
Sent: 12 August 2015 16:56
To: CARRASCO BENITEZ Manuel (DGT); public-dwbp-wg@w3.org
Subject: Re: Data Identification section (was Re: reviewing the BP doc)

Hi Tomas,

Pls see below.

On 12/08/2015 14:48, Manuel.CARRASCO-BENITEZ@ec.europa.eu wrote:
> Phil,
>
> Allow me to re-launch the proposal to have a separated document for URI so it can be properly addressed: shades of this question comes back from time to time. The proposal has been ready for nearly a year:
>
>   Compact Uniform Resource Identifier (COMURI)
>   http://dragoman.org/comuri
>
> The URI section in the DWBP document could contain an overview. This are the main points:
>
> * URI
> URI identify resources ('dumb strings')

With you so far.

>
> * Variant
> Resource may have several variants (representations)

Yep - language, format etc.

>
> * Dimension
> Variants dimensions are the types of representations, such as languages and formats



>
> * Direct identification
> Describe direct identification of variants for language and format
>    http://dragoman.org/comuri.html#dvar

OK.

>
> * Mnemonic
> URIs should be compact and mnemonic: human and machine friendly

Not OK IMO.

Sorry, I can't find where but I'm pretty sure I've gone on record before
now in not agreeing with you on this.

True - URIs should be no longer than necessary, but saying they should
always be short is dangerous and leads to ephemeral identifiers.

In the 1970s, http://example.eu/lakes would probably have referred to
the infamous European wine lake, or milk lake etc. Now it would be
something to do with INSPIRE.

Better to have

http://example.eu/myth/lakes/wine

and, for something as inherently complex as geospatial data:

http://location.data.gov.uk/so/{theme}/{class}/{inspireNamespaceId}/{inspireLocalId}[/{inspireVersionId}]

(this from
https://github.com/UKGovLD/URI-patterns-location/blob/master/URI%20for%20Location.md)


Of course that's a much longer string and in some circumstances is
therefore more awkward to work with but by using path segments to
effectively classify the identified thing, it's more likely to persist.


>
> * ?R?
> URI, URL, URN, IRI. Just use URI everywhere and add something like:
>
>    "In this specification, the term URI is used for the identification schemes: URI, URL, URN and IRI ..."
>
> This is line with the recommendation in RFC3986
> https://tools.ietf.org/html/rfc3986#section-1.1.3
>
>    " ... Future specifications and related documentation should use the general term "URI" rather than the more restrictive terms "URL" and "URN" ..."

But we *want* to be restrictive. We're only talking about HTTP URIs,
we're not talking about URNs, or even URLs. Hence I think we need to say
something, no?

In terms of the long history of your paper, it has been reviewed and
discussed many times.

For example

At TPAC last year it was resolved [1] that "all of our best practices
will be in one document, because it is easier for readers/implementers
to understand and easier for editors/contributors to manage"

That followed a discussion on the COMURI paper.

See closed action items and the threads they link t:
http://www.w3.org/2013/dwbp/track/actions/79
http://www.w3.org/2013/dwbp/track/actions/81
http://www.w3.org/2013/dwbp/track/actions/126

My suggestion, which is the same as my suggestion to the authors of the
enrichment paper and is consistent with the resolutions already passed,
is that we incorporate your text within the BP doc.

So, for example,

The dumb strings aspect. I put that in the intro to the section on Data
Identification - maybe you can suggest something better? Either in the
intro, as I have suggested, or in a BP.

All the stuff about variants and dimensions, that goes in the BP about
multiple formats I think. I started writing a suggestion for the example
section of that one last night but stopped so as not to confuse the
changes I'm suggesting for Data Identification. I was thinking of using
the Ordnance Survey data. We can take any location but, for the sake of
argument, let's take my favourite part of the world:

http://data.ordnancesurvey.co.uk/doc/7000000000025490

That data is available in HTML, JSON, XML and RDF/XML. Disappointingly,
it doesn't include labels in both English and Welsh but you can't have
everything. http://www.w3.org/2015/ceo-ld/ is in English and Chinese
(.html.en and .html.zh-hans) but that's not data in the sense we mean of
course.

Oh and I don I'm *really* hoping we don't need to say anything about
URIs for real world objects cf. data about them. Can we word the BP on
multiple formats by saying that

http://data.ordnancesurvey.co.uk/doc/7000000000025490

provides data about the Welsh county of Pembrokeshire without saying that

http://data.ordnancesurvey.co.uk/id/7000000000025490 is an identifier
for the place itself?

but we may have to face this (if not here, it's looming in the spatial
data WG).

WDYT?

Phil.

[1] https://www.w3.org/2013/meeting/dwbp/2014-10-31#resolution_6




>
>
> Regards
> Tomas
>
>
> -----Original Message-----
> From: Phil Archer [mailto:phila@w3.org]
> Sent: Wednesday, August 12, 2015 1:51 PM
> To: 'Data on the Web Best Practices Working Group'
> Subject: Re: Data Identification section (was Re: reviewing the BP doc)
>
> I've had a go at making changes based on this conversation. See
> http://philarcher1.github.io/dwbp/bp.html#DataIdentification
>
> I have:
>
> 1. Reduced the intro text. I find it hard to make no mention at all of
> what we mean by URI. If the new first paragraph is still unacceptable,
> then, OK, it will have to go, but we - like a lot of people - use the
> term URI when what we actually mean is HTTP IRI and, in a formal
> document, I feel we should have a way of saying "we know this is a
> simplification."
>
> I have retained the numbered list of points about URIs. I know it's more
> tutorial than specification but, well, IMO it needs saying. Point 3 is
> particularly relevant to what comes later on multiple formats.
>
> 2. For the BP "Use persistent URIs to identify datasets" I've removed
> the bulleted list entirely and I'm suggesting replacing it with a
> modified version of a table from the URI persistence study I did a
> couple of years ago. It shows a number of documents on the topic.
>
> Better? Worse? Too much?
>
> 3. I added in a whole new BP on URIs as identifiers within datasets.
> This includes examples of URIs that end in existing IDs.
>
> There's a close correlation between the data ID stuff and the BP about
> providing data in multiple formats - I'll write a separate mail about
> that to keep things manageable.
>
> Phil.
>
>
> On 07/08/2015 13:16, Makx Dekkers wrote:
>> On the topic of URL/URI/IRI, I think the current text is a bit out of scope.
>> In a way, BP10 is a 'best practice' for minting and maintaining URIs, not
>> for how to publish data on the web. And, frankly, I think the current
>> introduction in section 9.7 is very confusing.
>>
>> As far as I see it, the aspects of identification that concern the data are
>> that publishers should (a) assign URIs to datasets and any bits of data that
>> people may want to access and (b) define a persistence policy for the URIs
>> and the data. For the specifics of how to mint and maintain URIs, a couple
>> of references to external documents could be included.
>>
>> I would suggest not to go into the differences and overlaps between URL and
>> URI -- that will only confuse people. I agree with Annette that it would be
>> sensible to just use 'URI' in the document; people can then look at external
>> references if they are interested in these other acronyms.
>>
>> Makx.
>>
>>
>>> -----Original Message-----
>>> From: Annette Greiner [mailto:amgreiner@lbl.gov]
>>> Sent: 6 August 2015 22:20
>>> To: Phil Archer <phila@w3.org>
>>> Cc: Data on the Web Best Practices Working Group <public-dwbp-
>>> wg@w3.org>
>>> Subject: Re: Data Identification section (was Re: reviewing the BP doc)
>>>
>>> Hi Phil,
>>> Thanks for responding to my comments.
>>> Re the question of how to handle the URL/URI split, I suggest we just use
>> URI
>>> uniformly. In fact, the section in question appears to do just that. As
>> for IRI, I
>>> don't see that appearing anywhere in the latest published draft. The DOI
>>> issue I raised goes away if we remove the introduction to URIs/URLs/IRIs.
>>> Re keeping the implementation suggestions to info about publishing data, I
>>> don't see why you don't see how we can avoid talking about everything
>> else.
>>> I don't think our document would suffer at all from the removal of a
>> bullet
>>> point about using 303 redirects for real-world objects, like Alice Brown.
>> We
>>> are not talking about publishing people on the web.
>>>
>>> I'm also noticing that other bullets bother me for other reasons. "Re-use
>>> existing identifiers" is one. I'm not sure what the intention is. Surely
>> we don't
>>> want to suggest that publishers use the same identifier for more than one
>>> data set. "Link multiple representations" is another. I don't see why we
>>> would recommend using the url rather than query string or content
>>> negotiation to indicate formats. That rule disagrees directly with the
>> last one,
>>> "avoid file extensions".  If the intention is to remind publishers to make
>> both
>>> representations available, that is in a different BP. I think I disagree
>> with "Use
>>> a dedicated service (i.e., independent of the data originator)", if I'm
>>> interpreting it correctly. If I publish data from Lawrence Berkeley
>> National
>>> Laboratory, I think it is best practice for me to publish it on a server
>> managed
>>> by the Laboratory. I do think it's wise to use a reliable service, if
>> that's the
>>> idea. Why do we say to "avoid version numbers" for data? There is a BP
>> just
>>> below that says to assign URIs to dataset versions. Autoincrement is often
>>> useful in databases, so data identifiers can easily end up being auto
>>> incremented. I don't see a problem with using them in URLs if they are the
>>> unique identifiers for data rows, though I agree that dates are better for
>>> identifying a dataset. Why do we say to avoid query strings? They are
>> useful
>>> for requesting specific formats.  I understand the point in "cool URIs"
>> about
>>> not tying a URL to a specific implementation (like .html or .php), but in
>> the
>>> case of data in a specific format, it still makes sense. I think many of
>> these
>>> ideas make more sense if considered in the context of assigning resource
>>> identifiers for things other than published data.
>>> -Annette
>>> --
>>> Annette Greiner
>>> NERSC Data and Analytics Services
>>> Lawrence Berkeley National Laboratory
>>> 510-495-2935
>>>
>>> On Aug 6, 2015, at 8:57 AM, Phil Archer <phila@w3.org> wrote:
>>>
>>>> Hi Annette,
>>>>
>>>> You make several comments here, I want to reply to one particular set,
>>> hence the change in subject.
>>>>
>>>>
>>>> On 19/06/2015 03:03, Annette Greiner wrote:
>>>> [..]
>>>>
>>>>
>>>>> Data Identification
>>>>> The introductory text about URIs and URLs and IRIs is potentially
>>> confusing and not necessary for our audience to understand the BPs about
>>> identifiers.
>>>>
>>>> I disagree (which is why I wrote it of course!)
>>>>
>>>> The three terms *are* confusing and I was attempting to clear that up.
>> My
>>> reason being that we do talk about URLs and URIs and they're not
>>> interchangeable. A few, a very few, will talk about IRIs. Anyone dipping a
>> toe
>>> in reading a W3C spec these days will see that rare term and wonder what
>>> the heck it means.
>>>>
>>>> Do you think it's worth me having another shot at explaining the
>>> differences or are you opposed to including any such explanation?
>>>>
>>>>
>>>>
>>>>
>>>> Also, URLs are for for the internet, not just the web.
>>>>
>>>> That's not my understanding although I guess it's not an absolute
>>> distinction. To take an example of an Internet service that is not on the
>> Web,
>>> Skype doesn't use URLs except to address servers, the actual data is not
>>> transmitted using HTTP.
>>>>
>>>> I also disagree with the representation of DOIs as something that cannot
>> be
>>> looked up, though the question is not something I think we should make
>>> readers think about.
>>>>
>>>> Hobby horse alert!
>>>>
>>>> To look up doi:10.1103/PhysRevD.89.032002 you have to:
>>>>
>>>> - strip the doi: scheme;
>>>>
>>>> - choose a resolver service (that you have to already know about);
>>>>
>>>> - append the remaining string to that base URL to get something like
>>>> http://dx.doi.org/10.1103/PhysRevD.89.032002
>>>>
>>>> - use HTTP to dereference it.
>>>>
>>>> If you choose a different base URI and you might get something very
>>>> different (http://philarcher.org/10.1103/PhysRevD.89.032002 for
>>>> example ;-) )
>>>>
>>>> My intention when I included that was to point out that other identifier
>>> schemes, DOIs being one of the best known, are not dereferenceable and
>>> not (natively) part of the Web.
>>>>
>>>>
>>>>> * I would like this section to limit itself to information that applies
>> to
>>> publishing *data*.
>>>>
>>>> It's about identifiers and identifiers are dumb strings, therefore I
>> can't see
>>> how we can talk about identifiers that only apply to data and not
>> everything
>>> else.
>>>>
>>>>
>>>> The BP is about assigning persistent identifiers to datasets, but the
>> possible
>>> approach to implementation is about much more than that.
>>>>
>>>> Yes, but that's for the reason just given.
>>>>
>>>> The list items are also not consistent. (one shows use of extensions,
>>> another says not to do that).
>>>>
>>>> Fair enough, yes, I'd need to expand that and tie it back to the
>> multiple
>>> formats BP. I'd want to say something along the lines of:
>>>>
>>>> Use an identifier like http://data.example.org/doc/foo/bar to link to
>> the
>>> resource.
>>>>
>>>> Only include the file extension if it refers to a specific
>>>> representation of that resource, like
>>>> http://data.example.org/doc/foo/bar.rdf
>>>> http://data.example.org/doc/foo/bar.html
>>>>
>>>> (btw, a feature of w3.org's server set up is that we don't need to
>> include
>>> file extensions. A URL like http://www.w3.org/2013/share-
>>> psi/workshop/krems/report actually returns a .php file (you can add the
>>> extension of you like) ). We make a lot of use of conneg.
>>>>
>>>>
>>>> I worry that this will open up a holy war about how to implement a REST
>> API.
>>>>
>>>> OK, that we want to avoid and it's being dealt with in another thread.
>> But I
>>> am prepared to defend the general principles here - it's what marks out
>> the
>>> Web as a data platform and not a means of transmitting datasets that could
>>> just as easily be transported by sending a USB stick in the post.
>>>>
>>>> Phil.
>>>>
>>>>
>>>> For tracker: this is issue-194
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Phil Archer
>>>> W3C Data Activity Lead
>>>> http://www.w3.org/2013/data/
>>>>
>>>> http://philarcher.org
>>>> +44 (0)7887 767755
>>>> @philarcher1
>>
>>
>>
>>
>

--


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Thursday, 13 August 2015 08:38:21 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 13 August 2015 08:38:21 UTC