Re: Comments on "The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data" from Bernhard Haslhofer on 2008-04-28 (semantic-web@w3.org from April 2008)

From: Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>
Date: Mon, 28 Apr 2008 22:26:45 +0200
To: Tim Berners-Lee <timbl@w3.org>
Cc: bernhard.schandl@univie.ac.at, SW-forum Web <semantic-web@w3.org>, MacKenzie Smith <kenzie@MIT.EDU>
Message-Id: <A174A5E4-A303-440B-915B-573D17D542D0@univie.ac.at>
Hi Tim,

thanks for your response. It is great to get some comments on our  
work :-)

On Apr 27, 2008, at 2:51 AM, Tim Berners-Lee wrote:
> Bernhardt and Bernhardt,
>
> I saw your article chumped on the SWIG IRC channel.
> I had been looking for almost exactly what you have produced, to  
> get into dspace and eprints systems.
>
> 1. Is it not practical to make a general gateway which, by  
> including the whole URI of the OAI endpoint in the URI in the  
> linked data mapping, I could use the gateway to access LOD about  
> any OAI resource in the world?
>
> I wonder whether it is the fact that you have to cache most of the  
> site. Why is that, for speed, or because you can't get all the  
> links you want by asking the OAI server, and so so yo have to have  
> a copy of the data as a graph?  Could those aspects of the data  
> which can be got from an OAI fetch be proxied at LOD request time,  
> and not cached permanently, to save memory?
>

We use the caching approach for two main reasons:

1.) we also want to provide selective access to metadata via SPARQL.  
With OAI-PMH you can only fetch certain records or a whole list of  
records and then apply any selection criteria on the client side -  
which is not an optimal solution. Of course, the OAI-PMH never meant  
to be a protocol which supports structured queries - for that, DL  
community has other protocols. But anyway, fact is that many small-  
and medium-size instiutions have the OAI-PMH and will most likely  
never provide any DL query protocol. Thereforefore we decided to take  
what is there and simply cache metadata

2.) for linking data, we must analyse the source data set (i.e. the  
metadata coming from an OAI-PMH provider) and the target data set.  
For each linked target data source we must fetch the source data set  
at least once from the OAI-PMH data provider - so we already have the  
data at the client side and can also keep them stored. Furthermore,  
we must exend existing OAI-PMH metadata with links to other data -  
and we must store these links somewhere.

But you are right, a simple and probably also more scalable solution  
would be a gateway appraoch where a component exposes "proper" URIs  
for each item, translates the incoming HTTP requests to OAI-PMH  
specific HTTP requests, and simply uses some stylesheet to transform  
the data into RDF.

e.g.: http://oai.lcoa1.loc.gov/resources/item/ 
oai:lcoa1.loc.gov:loc.gdc/gcfr -> http://memory.loc.gov/cgi-bin/ 
oai2_0?verb=GetRecord&identifier=oai:lcoa1.loc.gov:loc.gdc/gcfr. 
0018_0163&metadataPrefix=oai_dc

However, the idea of OAI2LOD is also to show that HTTP, URI, RDF (and  
SPARQL) can cover most of the functionality the OAI-PMH provides (+  
also provide means for structuring queries). So if anybody can  
convince DL solution providers to adopt these technologies directly  
(by installing something like D2RQ on top of their RDB), there  
wouldn't be any need for gateways or replicating solutions anymore -   
but that might take some time :-)


> One interesting issue is the fact that the instance of OAI2LOD  
> needs to be started with some background data. That makes an  
> automatic gateway difficult, unless there is some way of extracting  
> the data from the OAI server itself.

The current version (0.2) is rather a demo than a production solution  
- after startup it simply fetches and caches the data in memory.  
Futher versions will have some (hopefully scalable) triple store in  
the backend.

>
> 2. Assuming now that you do have to run a separate OAI2LOD instance  
> for each OAI endpoint, do you think it would a good idea to make  
> the convention that the URI
>
> 	oai:lcoa1.loc.gov:loc.gdc/gcfr.0018_0163
>
> is served from a server at a DNS  ("oai" dot (the DNS name in the  
> OAI URI))? Like
>
> 	http://oai.lcoa1.loc.gov/resources/item/oai:lcoa1.loc.gov:loc.gdc/ 
> gcfr.
>
> or even maybe like
>
> 	http://oai.lcoa1.loc.gov/item/loc.gdc/gcfr.

That is actually the goal - institutions that are willing to expose  
their data as linked data should install an OAI2LOD instance in their  
own system environment and redirect/rewrite URLs so they fit the  
scheme you describe above.

>
> One could build into clients a mapping redirection, or in the short  
> term configure a generic proxy to do the redirection and configure  
> existing browsers to use that proxy for the oai: scheme.  It would  
> only happen when following an oai: link, as after that the client  
> would be in the world of http: names.
>
>
> 3. The use of "sameAs" to link the same work in different  
> repositories.  Is that really what you mean? It allows any  
> properties of one URI to be associated to the other URI.  So you  
> can't have any properties about the work which only apply to that  
> repository, like curation, persistence, etc
> I have created a sameWorkAs to get around this problem, in the  
> generic resource ontology
> http://www.w3.org/2006/gen/ont#sameWorkAs
> SameWorkAs should allow one to transfer properties of the generic  
> resource, like copyright holder, author, genre.  But not language,  
> curator, byte length, delivery format, etc, which vary repository  
> by repository would not transfer across sameWorkAs.
>

With version 0.2 of OAI2LOD you can in fact configure which "linking  
property" you would like to use for a certain linking rule. In the  
demos I set up - one links LOC data with DBPedia, the other Austrian  
National Library Data with DBPedia - we actually use rdfs:seeAlso  
properties because we haven't found any works that are the "sameAs"  
or the "sameWorkAs" other works.

The motivation for using "sameAs" was that two repositories, might  
maintain metadata for the the same, for instance, books. But of  
course, these books would in fact not be the same books but rather  
the same works. From the library community I know the FRBR model  
which provides a set of entities distinguishing betwen "work",  
"expression", "manifestation", and "item"....


> The TAG discussed this issue recently.
>
> I'm on a plane or I would be tempted to try out OAI2LOD directly.
> (MacKenzie, have you tried this on MIT Dspace?)

There are also two running demos at http://www.mediaspaces.info/tools/ 
oai2lod/

If you try it out directly, please let us know any further comments,  
suggestions, etc.

>
> Tim


Best,
Bernhard

-- 
_______________________________________________________
Research Group Multimedia Information Systems
Department of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna

Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
E-Mail: bernhard.haslhofer@univie.ac.at
WWW: http://www.cs.univie.ac.at/bernhard.haslhofer
Received on Monday, 28 April 2008 21:33:28 UTC