Re: Author identifier co-reference Was: Re: Comments on "The OAI2LOD Server: ..." from Hugh Glaser on 2008-04-28 (semantic-web@w3.org from April 2008)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Mon, 28 Apr 2008 01:30:55 +0100
To: Tim Berners-Lee <timbl@w3.org>, Les Carr <lac@ecs.soton.ac.uk>
CC: "Bruce D'Arcus" <bdarcus@gmail.com>, Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>, "bernhard.schandl@univie.ac.at" <bernhard.schandl@univie.ac.at>, SW-forum Web <semantic-web@w3.org>, MacKenzie Smith <kenzie@mit.edu>, Ian Millard <icm@ecs.soton.ac.uk>
Message-ID: <C43AD94F.2440F%hg@ecs.soton.ac.uk>
Since you ask :-)
I'll try to be brief and stick to abstraction.
A latest description can be found in Afraz' forthcoming paper at IRSW2008:
http://eprints.ecs.soton.ac.uk/15614/

Yes, the crunch comes in deciding to build with authorities or without.
Building with an assumption of authority is very different from without, and
unlikely to work well in the absence of an authority. On the other hand, a
system that does not expect or require authority can be built in such a way
as to benefit from any authorities available.
I choose the latter approach.


On 27/04/2008 23:05, "Tim Berners-Lee" <timbl@w3.org> wrote:

>
>
> On 2008-04 -27, at 15:58, lac wrote:
>
>>
>>
>> On Sun, 27 Apr 2008 15:41:19 -0400, "Bruce D'Arcus"
>> <bdarcus@gmail.com>
>> wrote:
>>> I wouldn't be so quick to diplomatically brush this aside. The
>>> library
>>> world is finally taking steps into the semantic web [1], but there's
>>> still a lot of work to do here, and making name authorities suitable
>>> for the 21st century has to be a big one.
>>
>> Diplomatically acknowledge, by all means. But the crux of Hugh's
>> work is
>> that the idea of a single point of authority for names is completely
>> outdated and should be scrapped with extreme prejudice. But currently
>> digital librarians, repository implementers and funders are
>> promoting the
>> idea of a unified "author name authority" for the whole of Europe.
>
>
> I gave a talk a while ago at Crossref.org's annual meeting - these are
> the DOI folks, one of the players in the game to be Great One
> Denominator for works.  A question they asked was along the lines of:
> "We are desperately in need of identifiers for people.  But while we
> have the authority to mint identifiers for books, who are we to mint
> identifiers for people?".   In the  centralized model, indeed you have
> to go to be the one denominator or you don't denominate at all.  I
> tried to explain that it would be really useful for me if my publisher
> had a URI for me, and hung off it the things it knows about me, and it
> would not at all conflict with the fact that I have other identifiers.
>
> One way of looking at this in fact is that in a scale-free web you
> expect, and in fact optimize for, a situation where there are some
> major players which operate in a centralized fashion, but they are not
> unique, and you just  integrate them with everything else. Then when
> you want a name, you have a choice of whether to go for a rather
> costly bureaucratic EU-wide (or even ask the UN) name,  or use one
> from your local university, or just mint one yourself.  The existence
> of the various forms of name are useful in different ways. So so long
> as the unified "author name authority" isn't averse to there being
> others, it is fine.
>
> In practice, what is your feeling from CRS -- that in this case each
> repository should do its own cleaning up and present its own
> identifiers for authors, and then build a c-reference management
> systems to connect those authors to authors in other repositories
> (including a Europe-wide one if someone wants to make one, but also
> including my RDF id in my FOAF file)?
> Or should each repository source possibly multiple ids for authors,
> and let an external system, maybe the authors themselves, clean up by
> using a co-reference service (or just a FOAF file) to generate the "I
> am this person" and "I am not this person" links?
My simple answer is that at least all of the above should be facilitated in
a single system.
The important thing in all this is that it is up to the SW
application/agent/whatever to decide what is a valid source of equivalence
statements (EQS), according to its trust/provenance/budget/context...
So, for example, our School might publish data about you at a resolvable
URI, and in doing so it should be able to assert equivalences that others
might use. These might be found for example in the same KB, a distinguished
place on the same domain, or in a link(s) in the RDF obtained by resolution.
(Having been a fellow-traveller in the hypertext world, I avoid the first
option, which is a bit like embedding such links in the same KB, so I expect
it to be in an external KB. As we only allow Linked Data, and URIs are
opaque, we avoid the second option, and use the third one.)
Whatever the placement, I call our version of the abstraction a CRS -
Consistent reference Service, for sort of historical reasons.

There are some complexities, but basically a CRS is a service where you can
give it a URI, and it tells you other URIs that it thinks are equivalent.

With this component defined, everything else falls into place.
Note the important thing is that no resource has to re-IndexKey their data
by changing URIs, unless they want to.
Continuing the example in a temporal fashion, as one of the problems with an
authority is that is not robust to structural changes in the data which are
often brought on my temporal changes...
My School may be working quite hard to help me, and so publishes EQS against
a lot of other sources, including eprints archives elsewhere. It might even
provide a facility for me to interact with it (or give it my foaf file and
believe my sameAs assertions), and use the results.
My University brings up their own repository, but decides it currently has
little interest in establishing EQS. So no CRS.
Then the funding sources (eg EU) decide they want an authority, so they
bring up a CRS, and require my University to tell them about you. So they
bring up a CRS with an EQ for EU/University.
Our School is a bit slow in responding, but fortunately ISI is not, and
provides a service to link your EU URI with the School URI (and some
others). Unfortunately, they want charge to users money for it (it cost them
a lot to provide a high-quality CRS), so some agents won't pay for it.
Someone else provides one, but it is not very good, so agents won't use it
either. Fortunately, our School starts up the service, and adds the EU URI
to its CRS (and of course people will trust the CRS linked from the
resolvable URI).
Oxford finds it needs a URI for you, but does not feel the need to mint a
new one. It looks at this University and finds their URI for you, but
wonders if that is the one to choose. In theory it does not matter which it
chooses, as they are being eventually linked in the CRS world, but the
systems work better if there is general agreement (less need to consult
CRSes), and so it decides to use the EU one.
Finally we hit the clash of worlds.
NSF had the same idea as the EU: they were going to be the authority, and
told you what URI you were to use, and MIT decided to accept it. And neither
want to give in. Fortunately, they don't have to. W3C decides it needs to
mediate, and brings up a CRS linking you in the EU to you in the US.

Note that nothing except the original sites have anything other than URI
equivalences, so no data is being copied. A CRS is simply a straightforward
KB about URI equivalence.
And it does not have to be comprehensive and symmetric with the other
repositories of URIs it knows about.
What if a CRS is failing to respond in time because of overloading?
OK, the agent's knowledge of EQS may be damaged, but it can still proceed
quite well. And if it is interacting with systems that have recorded some
"canonical URIs" (another aspect of the CRSes I have not mentioned), there
is no loss.
Mirroring of CRSes is fine.

So now I should answer your question about what are my feelings about such a
system!
It is possible to build it, and have a SW application that uses it to
embrace EQS over our more than 20 triplestores to do considerable network
analysis (COPs) in a unified way. Without an explicit authority. But we have
not yet challenged it to use foreign (network-cost) CRSes. And scaling such
an application beyond 100 URIs for a single entity will get difficult at the
moment (hence the canonical URI, which can act as a transient authority).
But in an open system such as described, service providers can always step
into the market place to fill needs such as this.
For example, I would certainly expect someone to bring up a CRS of harvested
sameAs data from foaf files, perhaps while providing the foaf reverse index.
Another issue is about finding the RDF about URIs in CRSes. I live in the LD
world, so this can be done simply by resolving them. For finding it in other
places, search engines are the technology, but exactly what they do is the
subject of a future thread.

Very finally, note that nothing that has been said is type-specific. We have
considered the people problem, but as CRSes do not need to know anything
about type, the contents could just as easily be papers or institutions.

Sorry this has not been brief, but it is a complex problem, which one hopes
has a simple solution.
>
> Tim
>
> PS: (Yes, there is a huge difference of philosophy between the
> decentralized and the centralized mentalities, that the goal of a
> large organization can be regarded by others as a bug.  Or, as Simon
> put it, that "one man's sealing is another man's flaw".)
>
>
>>
>> ---
>> Les
>
>
Received on Monday, 28 April 2008 00:32:33 UTC