Re: A better solution for legacy IDs? from Tom Morris on 2011-12-14 (public-lld@w3.org from December 2011)

From: Tom Morris <tfmorris@gmail.com>
Date: Wed, 14 Dec 2011 11:43:27 -0500
To: Karen Coyle <kcoyle@kcoyle.net>
Cc: Adrian Pohl <pohl@hbz-nrw.de>, public-lld <public-lld@w3.org>
Message-ID: <CAE9vqEHmgYtg4y8frvioiycmnsU0OTYC0o4DdQoFAOU9vB9O4Q@mail.gmail.com>
OK, so based on the discussion I understand we're talking about schema, not
an actual registry (graph).  It all boils down to where the namespace for
the namespace is encoded (namespace in the generic sense, not RDF or XML
sense).  Possible options mentioned include:

1. Namespace is implicit in definition of property which is child of
dc:identifier (e.g. bibo:ISBN)
2. Namespace is encoded in a private datatype which is used with literal
value of dc:identifier
3. Namespace is encoded in the URI path
  3a. info: URIs
  3b. URNs
  3c. dereferencable (ie HTTP) URIs
4. Namespace is encoded in a special string notation (e.g. (CaSfIAP)1234567)

Both #1 and #2 require some type of registry of known namespaces.
Registries are a pain to manage (the "scaling" problem that Karen
mentioned).  #3 has a predefined hierarchical namespace which allows for
distributed assignment, but you need to find a spot to hook into the
hierarchy. Anything that requires non-standard string parsing (such as #4)
seems like a really bad idea.

There's a lot of overlap between info URIs [1] and URNs [2] in that
info:lccn/ could also be encoded as urn:nbn:us:lc or some such.  The
justificationfor info URIs [3] doesn't seem particularly compelling (or
accurate) and since that URNs seem to have more momentum, let's focus on
them.

Antoine's
http://www.europeana.eu/portal/record/92056/BD9D5C6C6B02248F187238E9D7CC09EAF17BEA59.html
is also known as http://www.dlib.si/details/URN:NBN:SI:snd-56N8AX71 which
demonstrates that some people are using URNs for this type of thing already.

Karen's example suggested the use of a code from the MARC Code List for
Organizations [4] and there's also the ISIL registry [5].  Some
registration authorities such as the British Library are using the same
codes for both MARC and ISIL (which certainly makes sense).

I'd suggest using taking advantage of these existing registries and getting
them hooked into the URN namespace.

Some possible examples:
  - urn:nbn:us:marc:casfiap or urn:isil:us:marc:casfiap for the Prelinger
Archive
  - urn:isil:uk:uklobm-1234 for IDs assigned by the British Museum

The trick is to find an amenable registration authority at the right spot
in the hierarchy and making the necessary justifications to them.

Tom

p.s. Universal identifiers (as suggested by Stuart) would be awesome.  This
XKCD pretty much sums up how these efforts usually evolve though
http://xkcd.com/927/

[1]
http://info-uri.info/registry/OAIHandler?verb=ListRecords&metadataPrefix=oai_dc
[2] http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xml
[3] http://info-uri.info/docs/misc/faq.html#use_urn
[4] http://www.loc.gov/marc/organizations/org-search.php
[5] http://biblstandard.dk/isil/

On Tue, Dec 13, 2011 at 9:30 AM, Karen Coyle <kcoyle@kcoyle.net> wrote:

> Adrian, thank you for providing a clearer statement of the problem!
>
> What remains, though, is how to scale the solution. For example, presuming
> that libraries begin to export some bibliographic data in RDF, it will be
> necessary to provide an identifier for each resource. One easy solution is
> to use the existing database record id. The practice in US libraries is to
> prepend the standard institutional identifier to the database record id to
> create a unique identifier:
>  (CaSfIAP)1234567
> That could be simply a value of dcterms:identifier since there is a known
> practice, but I'm not sure this is enough information for this identifier
> as it travels outside of "library space." Eventually the MARC institution
> codes will probably each have a URI, but until then.... Plus there are many
> other such identifiers that don't have a standard practice like we have in
> libraries.
>
> kc
>
>
> Quoting Adrian Pohl <pohl@hbz-nrw.de>:
>
>  For clarification: As I understand it there are two options for dealing
>> with legacy identifiers in RDF:
>> 1.) One could provide subproperties of dcterms:identifier like bibo:isbn
>> or europeana:localIdentifier.
>> 2.) One could mint URIs for each identifier in a specific namespace for
>> each identifier scheme.
>>
>> (BTW, combining these approaches in one triple using dcterms:identifier
>> or any of its subproperties isn't possible as rdfs:range of
>> dcterms:identifier is rdfs:Literal.)
>>
>> In the culturegraph project we tend to the first approach though there is
>> no vocab for identifiers used published yet. I see a problem in the second
>> approach (minting URIs for legacy identifiers). The question is: How would
>> they be used? They are two ways of using these which both might make sense
>> (in the example I use info-URIs, one could also use HTTP URIs):
>>
>> a) as identifiers for the identifier (e.g. ISBN) like in: <
>> http://lobid.org/resource/**HT002948556<http://lobid.org/resource/HT002948556>>
>> bibo:isbn10 <info:0915145537> , <info:0915145529> .
>> b) as identifiers for the bibliographic resource like in <
>> http://lobid.org/resource/**HT002948556<http://lobid.org/resource/HT002948556>>
>> owl:sameAs <info:0915145537> , <info:0915145529> .
>>
>> I've already seen both approaches popping up in discussions (see [1]).
>> Regardless of the question whether owl:sameAs is the right property to use
>> it is clear that a problem might result if different people use the same
>> URI in different ways.
>>
>> I think the approach of the pen citations project[2] combines both
>> approaches in a sensible way, i.e. journals are named by urn:issn URIs and
>> the the RDF describing a journal looks like this (see [3] for the turtle
>> file):
>>
>> <urn:issn:1556-4681> a <http://purl.org/spar/fabio/**Journal<http://purl.org/spar/fabio/Journal>
>> >;
>>    prism:issn "1556-4681".
>>
>> Adrian
>>
>> [1] http://answers.semanticweb.**com/questions/3572/xsd-or-**vocabulary<http://answers.semanticweb.com/questions/3572/xsd-or-vocabulary>
>>
>> [2] http://opencitations.net/
>>
>> [3] http://opencitations.net/doc/?**uri=urn%3Aissn%3A1556-4681&**
>> format=ttl<http://opencitations.net/doc/?uri=urn%3Aissn%3A1556-4681&format=ttl>
>>
>>  On 11.12.2011 at 19:50, in message
>>>>>
>>>> <20111211105005.**14406nxoc3xqte8t@kcoyle.net<20111211105005.14406nxoc3xqte8t@kcoyle.net>>,
>> Karen Coyle <kcoyle@kcoyle.net>
>> wrote:
>>
>>> I keep running into the same problem in different projects: we've got
>>> a bunch of legacy identifiers, like ISBNs, PMIDs, OCLC numbers, etc.
>>> It's important to carry them in the linked data that we are creating,
>>> but the maintenance agencies haven't provided them with URIs. That
>>> means we need to keep the base identifier string along with something
>>> that, well, identifies the identifier. I know that BIBO has BIBO:ISBN,
>>> etc., but it's just not going to work to create a separate property
>>> for each one of these, the number of them is too large.
>>>
>>> Has anyone developed and published a good "legacy identifier graph"
>>> that we could adopt? If not, would someone like to propose one?
>>>
>>> Thanks,
>>> kc
>>>
>>
>>
>>
>>
>>
>
>
> --
> Karen Coyle
> kcoyle@kcoyle.net http://kcoyle.net
> ph: 1-510-540-7596
> m: 1-510-435-8234
> skype: kcoylenet
>
>
>
Received on Wednesday, 14 December 2011 16:43:59 UTC