Re: [urn] fragment identifiers from Juha Hakala on 2011-03-10 (uri@w3.org from March 2011)

From: Juha Hakala <juha.hakala@helsinki.fi>
Date: Thu, 10 Mar 2011 14:28:48 +0200
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
CC: Peter Saint-Andre <stpeter@stpeter.im>, "uri@w3.org" <uri@w3.org>, urn@ietf.org
Message-ID: <4D78C400.8060308@helsinki.fi>
Hello Martin; all,

A few comments below.

Martin J. Dürst wrote:
> Hello Peter,
> 
> I have cross-posted to the URI list, because I think it's important to 
> get input from more experts. People on the URI list, this is about what 
> to do (or not to do) about fragment identifiers in URNs, raised in the 
> context of an update of RFC 2141.

For the URN community this issue is important because there are 
initiatives which are eager to use fragment identifiers. I have heard 
rumours that some are already using them. A typical use case would be a 
very complex data such as structured research data set within which many 
kinds of data should be separately described, identified and retrieved.
> 
> On 2011/03/10 13:30, Peter Saint-Andre wrote:
>> <hat type='individual'/>
>>
>> On 3/9/11 2:11 AM, "Martin J. Dürst" wrote:
>>>
>>> On 2011/03/09 13:51, Peter Saint-Andre wrote:
> 
>>> Anyway, from a higher-up view, RFC2141bis is defining the "urn:" URI
>>> scheme, and URI scheme definitions in general are supposed to say
>>> nothing (or just a little in some exceptional cases) on fragment
>>> identifiers. The reason for this is that fragment identifiers are
>>> defined per MIME Media Type, not per URI scheme.
>>>
>>> So if I have something like "urn:foo:bar:baz#here", then the urn spec
>>> only has to say what "urn:foo:bar:baz" is supposed to mean, the meaning
>>> of "here" is defined by whatever format I might get back when resolving
>>> "urn:foo:bar:baz". If I have a browser that resolves (some) urns (I
>>> don't know one, but there should be some), this is what already happens,
>>> and it shouldn't and won't change. RFC2141bis doesn't have to say
>>> anything for this to work.
>>>
>>> In case RFC2141bis tries to do anything else than the above, that would
>>> be a very bad idea, and should be fixed quickly.
>>
>> Here is what RFC 3986 says:
>>
>>     The semantics of a fragment identifier are defined by the set of
>>     representations that might result from a retrieval action on the
>>     primary resource.  The fragment's format and resolution is therefore
>>     dependent on the media type [RFC2046] of a potentially retrieved
>>     representation, even though such a retrieval is only performed if the
>>     URI is dereferenced.  If no such representation exists, then the
>>     semantics of the fragment are considered unknown and are effectively
>>     unconstrained.  Fragment identifier semantics are independent of the
>>     URI scheme and thus cannot be redefined by scheme specifications.
>>
>> As far as I can see, the semantics of fragment identifiers in URNs would
>> not be defined by media types because URNs are not generally resolved
>> for the purpose of retrieving a representation.
> 
> "not generally" and "not" are not the same. Even for http: URIs, it's 
> true that they are not always resolved. So in that sense, if I use
> http://never_any_server_here.sw.it.aoyama.ac.jp/one/two/three
> with some fragment identifier (I'm in control of sw.it.aoyama.ac.jp and 
> make sure that there never is a server at 
> never_any_server_here.sw.it.aoyama.ac.jp), then I'm indeed unconstrained.
> 
> On the other hand, for quite a few URNs, it would make a lot of sense to 
> resolve them. Let's say I have set up some proxy or use some dedicated 
> browser that helps me resolve some URNs. Then the paragraph from RFC 
> 3986 that you cite above clearly applies.

Persistent identifiers will be used for multiple purposes, and by the 
time we assign e.g. a URN to a resource, we have no idea which 
resolution  services will be needed in the (distant) future. Lifetime of 
a PID may be centuries; applications and the functionality they offer 
will change many times during such a period. And eventually even the 
copyright protection of a document will expire ;-).

Retrieving a representation is one the key resolution services supplied 
already. But there does not need to be a 1:1 relation between a URN (or 
any other persistent identifier) and the URI (URL/URLs) it maps to via a 
resolution service.

For example, consider:

DOI: 10.1016/B978-0-240-81330-1.00007-5

This is a real Digital Object Identifier based on ISBN of Tomlinson 
Holman's Sound for film and television (3rd ed.), but please note that 
this DOI does not identify the entire book, but just a chapter within 
it. The final section of the DOI suffix (00007-5) signifies the second 
chapter of the book. Each chapter has its own DOI, and they will most 
likely be available for purchase as individual files, so the URIs these 
DOIs resolve to will not have <fragment>s in them. But if the above 
"extended ISBN" were expressed as URN, we might come up with something like:

URN:ISBN:978-0-240-81330-1#00007-5

if this were the way in which identifiers for book chapters were 
expressed according to the ISBN standard and in the ISBN namespace. This 
URN would then resolve to the same PDF file as the DOI above, either in 
the same digital library or in some other digital asset management 
system.

>> Therefore, in the
>> context of URNs, the semantics of the fragment would be considered
>> unknown and would be effectively unconstrained (at least from the
>> perspective of the 'urn:' URI scheme).
> 
> Non sequitur.
> 
>> 2141bis seems to imply that the semantics of the fragment identifier
>> could be constrained by the definition of a particular URN namespace
>> (despite the fact that they are not constrained by the 'urn:' URI scheme
>> itself).

Yes; some namespaces / identifier systems will not allow usage of 
<fragment> since the syntax of the identifier does not support such a 
thing. For instance, the example shown above

URN:ISBN:978-0-240-81330-1#00007-5, or ISBN string

ISBN 978-0-240-81330-1#00007-5

is imaginary, since ISBN standard does not actually support this. DOI 
does, and one might also construct national bibliography numbers (NBNs) 
and consequently URNs which consist of ISBN and fragment identifier. 
Thus DOI namespace (if one is registered in the future) and NBN 
namespace should support <fragment>, if we are to give free hands to 
people using these identifiers in the URN context.

> That would make at least some limited sense, if we could sort namespaces 
> by whether they (maybe only occasionally) allow resolution, or whether 
> they are absolutely and terminally never ever going to be used for 
> resolution. 

Based on what I have said before, I don't think that resolution is the 
crucial factor here. And if I am wrong and it is, then any namespace may 
allow resolution at some point in the future when the requirements of 
the user community change.

But the last sentence from the paragraph you cite says:
> 
>                    Fragment identifier semantics are independent of the
>    URI scheme and thus cannot be redefined by scheme specifications.
> 
> This not only means that the URN spec (which is just the definition of 
> the 'urn:' URI scheme) cannot redefine fragment identifier semantics, it 
> also seems to imply that scheme specifications (including the URN spec) 
> cannot delegate such semantics to some subspaces of the scheme.

Yes.
> 
>> I'm not sure what the use cases are here, but perhaps folks on
>> the list could explain a bit more what they mean by reusing an
>> identifier scheme that designates objects of such complexity that it is
>> necessary to reference parts of the objects via fragment identifiers.

I can give one practical example from my own library.

Like many other national libraries, we digitise old books. The outcome 
of the process is a METS container, within which the full text of the 
book is stored in structured XML (METS/ALTO). The structure expresses 
chapters, and some information objects such as images.

Each chapter has currently its own URN:NBN, so in addition to being able 
to provide a persistent link to the title page of the book, such links 
can also be made to the chapters and other component parts of the book. 
We believe that some users will find such functionality useful (and they 
will also be happy when the URNs will still be functional many years 
from now, unlike many URIs that were thought to be cool).

If usage of <fragment> is allowed in RFC2141bis and within the NBN 
namespace, we might change the current policy and assign just one 
URN:NBN to the book itself, and then fragment identifiers based on the 
NBN to the chapters and other component parts of the book. Our URN 
resolver would be able to map these URN:NBNs to the correct component 
parts within the METS container (or any other container standard we will 
rely on in the future.

> I'm looking forward to hear from other people on this list, but 
> essentially even if there are very complex objects, there are always 
> different ways to identify components than using a '#'.

True - in our case, the national library of Finland can continue the 
current policy and assign an NBN to each component part. Nevertheless, 
it may be a good idea to allow choice between two different approaches. 
In some cases, using <fragment> can be more convenient than assigning 
individual identifiers. Research data sets come to mind; perhaps 
somebody from that community can describe the requirements?

Best regards,

Juha
> 
> Regards,   Martin.
> 

-- 

  Juha Hakala
  Senior advisor, standardisation and IT

  The National Library of Finland
  P.O.Box 15 (Unioninkatu 36, room 503), FIN-00014 Helsinki University
  Email juha.hakala@helsinki.fi, tel +358 50 382 7678
Received on Thursday, 10 March 2011 13:30:06 UTC