Re: Comments on "The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data" from Hugh Glaser on 2008-04-27 (semantic-web@w3.org from April 2008)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Sun, 27 Apr 2008 17:40:30 +0100
To: Tim Berners-Lee <timbl@w3.org>, Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>, "bernhard.schandl@univie.ac.at" <bernhard.schandl@univie.ac.at>
CC: SW-forum Web <semantic-web@w3.org>, MacKenzie Smith <kenzie@MIT.EDU>, Les Carr <lac@ecs.soton.ac.uk>, Ian Millard <icm@ecs.soton.ac.uk>
Message-ID: <C43A6B0E.243E1%hg@ecs.soton.ac.uk>
Great stuff.

Detailed comments below, but I think it may be fair to say that so far the
OAI community has been more concerned with creating and publishing the OAs,
rather than facilitating their complex use by sophisticated open agents such
as Semantic Web applications.
This is for very good reasons, of course, since their is a political and
social agenda in getting the data freed and building a community doing it. I
see a strong parallel with the LOD Project.
It is therefore timely in both communities for studying the next stage (such
as OAI-ORE), where we can get leverage from the sources built.

The OAI is a rich source of metadata (and data) of the sort we need, and
we've been using eprints.ecs.soton.ac.uk as one of our sources for a while
now, so here is the rkbexplorer.com take:
(In what follows I may have got some of my OAI stuff wrong, and I am not so
up on dspace and fedora - sorry.)

On 27/04/2008 01:51, "Tim Berners-Lee" <timbl@w3.org> wrote:

>
>
> Bernhardt and Bernhardt,
>
> I saw your article chumped on the SWIG IRC channel.
> I had been looking for almost exactly what you have produced, to get
> into dspace and eprints systems.
>
> 1. Is it not practical to make a general gateway which, by including
> the whole URI of the OAI endpoint in the URI in the linked data
> mapping, I could use the gateway to access LOD about any OAI resource
> in the world?
>
> I wonder whether it is the fact that you have to cache most of the
> site. Why is that, for speed, or because you can't get all the links
> you want by asking the OAI server, and so so yo have to have a copy of
> the data as a graph?  Could those aspects of the data which can be got
> from an OAI fetch be proxied at LOD request time, and not cached
> permanently, to save memory?
>
> One interesting issue is the fact that the instance of OAI2LOD needs
> to be started with some background data. That makes an automatic
> gateway difficult, unless there is some way of extracting the data
> from the OAI server itself.
Clearly we would like to use OAI archives as LOD sites, by asking them to
publish as LD.
Since the eprints team coincidentally happens to also be at Southampton, if
I knew exactly how to spec that, I'm sure I could get them to. But until we
have some really good spec (which rich ontology [dc is not enough - see
below], what URIs?...) that will apply in general, it has hard to feel I can
ask them to put in the work.

So the LD worker has to do the mediation.
This means taking the metadata from the OA, and processing it.
A script such as OAI2LOD is required. But what from?
OAI-PMH specifies dc with optional extensions.
>From the point of view of a rich LD site, this is disappointing. We really
need to get at all that good metadata in the OA site in a standard way.
As many LD people agree (I think?), normal dc output is difficult to use, as
it lacks the detailed structure and identifiers that are bread and butter to
LD.
Our own approach was to ask the eprints team to output their archive in yet
another form (after bibtex, endnote, refer, ...), which simply gave all
metadata as xml. And they obliged. So it is possible (via URI) to acquire
this for single papers or aggregates, (including their identifiers for
people, publications, etc) and run a script to convert to RDF against our
ontology. This output type is supported in eprints3. So we can provide
eprintsXML2RDF.php, which is analogous to OAI2LOD, and we now harvest from
other archives.
And yes, we build a cache. So for example the publications listed by the
person http://id.ecs.soton.ac.uk/person/2686 can also be found at
http://southampton.rkbexplorer.com/id/person-02686
Should we? Probably not. But I think we are still exploring how to do these
things. Once we really know, it should be possible to push back to a
wrapper, and then back to the OAI providers for them to publish as LD (do
you agree, B&B?).
Also, we need to do a lot of processing over the OA entries to establish the
linkage. When we run our co-reference identification tools, by following
person-02686 above, you will find that even the original, carefully-curated
eprints site has 8 extra string versions of this author (before he joined
Southampton). And then we have identified 5 other LD sites with this author
in them, totalling over 70 URIs.
I suspect we would have to cache the data to do this level of processing,
but we have not yet done the experiments.

>
> 2. Assuming now that you do have to run a separate OAI2LOD instance
> for each OAI endpoint, do you think it would a good idea to make the
> convention that the URI
>
>         oai:lcoa1.loc.gov:loc.gdc/gcfr.0018_0163
>
> is served from a server at a DNS  ("oai" dot (the DNS name in the OAI
> URI))? Like
>
>
> http://oai.lcoa1.loc.gov/resources/item/oai:lcoa1.loc.gov:loc.gdc/gcfr.
>
> or even maybe like
>
>         http://oai.lcoa1.loc.gov/item/loc.gdc/gcfr.
>
> One could build into clients a mapping redirection, or in the short
> term configure a generic proxy to do the redirection and configure
> existing browsers to use that proxy for the oai: scheme.  It would
> only happen when following an oai: link, as after that the client
> would be in the world of http: names.
I'm sure some convention like this would be good.
But we should not lose sight of the fact there are a lot of other things to
be identified in an OAI repository. Having URIs for people, in particular,
is crucial. As far as I can see, the ability to uniquely identify an author
and editor has not been a strong issue in the OAI community, and we need to
encourage it.
>
>
> 3. The use of "sameAs" to link the same work in different
> repositories.  Is that really what you mean? It allows any properties
> of one URI to be associated to the other URI.  So you can't have any
> properties about the work which only apply to that repository, like
> curation, persistence, etc
> I have created a sameWorkAs to get around this problem, in the generic
> resource ontology
> http://www.w3.org/2006/gen/ont#sameWorkAs
> SameWorkAs should allow one to transfer properties of the generic
> resource, like copyright holder, author, genre.  But not language,
> curator, byte length, delivery format, etc, which vary repository by
> repository would not transfer across sameWorkAs.
Thanks Tim, I was not aware of this (sorry!). This is clearly a great step
in the right direction - and I can see how we can use our CRS architecture
to generate sameWorkAs.
>
> The TAG discussed this issue recently.
>
> I'm on a plane or I would be tempted to try out OAI2LOD directly.
> (MacKenzie, have you tried this on MIT Dspace?)
>
> Tim
>
>
As I said at the start, the OAI community seem to be going through a similar
stage to the LOD community, investigating interoperability, harvesting,
applications, aggregators, etc.. (For the latest update see the recent Open
Repositories 2008 (http://or08.ecs.soton.ac.uk/ ). Were the LOD community
more mature, we could simply suggest they publish in the right form for us,
and then use our stuff! Or maybe they are ahead of us, and we should use
theirs? Fortunately there are people who are in both communities, as I
believe that many of the same problems are being solved here*, and we must
work to ensure that we learn from each other.

Thank you if you got this far.

Hugh

--
Hugh Glaser,  Reader
              Dependable Systems & Software Engineering
              School of Electronics and Computer Science,
              University of Southampton,
              Southampton SO17 1BJ
Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045
Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652
http://www.ecs.soton.ac.uk/~hg/


* Our work on the co-reference identification and storage (CRS) was
originally inspired by a thought experiment with Les Carr on what would
happen when all these institutions had their own repositories. The need was
not apparent, so we have deployed it with the LD. I suspect the need is now
becoming apparent.
Received on Sunday, 27 April 2008 16:42:08 UTC