- From: Hugh Glaser <hg@ecs.soton.ac.uk>
- Date: Sun, 27 Apr 2008 17:40:30 +0100
- To: Tim Berners-Lee <timbl@w3.org>, Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>, "bernhard.schandl@univie.ac.at" <bernhard.schandl@univie.ac.at>
- CC: SW-forum Web <semantic-web@w3.org>, MacKenzie Smith <kenzie@MIT.EDU>, Les Carr <lac@ecs.soton.ac.uk>, Ian Millard <icm@ecs.soton.ac.uk>
Great stuff. Detailed comments below, but I think it may be fair to say that so far the OAI community has been more concerned with creating and publishing the OAs, rather than facilitating their complex use by sophisticated open agents such as Semantic Web applications. This is for very good reasons, of course, since their is a political and social agenda in getting the data freed and building a community doing it. I see a strong parallel with the LOD Project. It is therefore timely in both communities for studying the next stage (such as OAI-ORE), where we can get leverage from the sources built. The OAI is a rich source of metadata (and data) of the sort we need, and we've been using eprints.ecs.soton.ac.uk as one of our sources for a while now, so here is the rkbexplorer.com take: (In what follows I may have got some of my OAI stuff wrong, and I am not so up on dspace and fedora - sorry.) On 27/04/2008 01:51, "Tim Berners-Lee" <timbl@w3.org> wrote: > > > Bernhardt and Bernhardt, > > I saw your article chumped on the SWIG IRC channel. > I had been looking for almost exactly what you have produced, to get > into dspace and eprints systems. > > 1. Is it not practical to make a general gateway which, by including > the whole URI of the OAI endpoint in the URI in the linked data > mapping, I could use the gateway to access LOD about any OAI resource > in the world? > > I wonder whether it is the fact that you have to cache most of the > site. Why is that, for speed, or because you can't get all the links > you want by asking the OAI server, and so so yo have to have a copy of > the data as a graph? Could those aspects of the data which can be got > from an OAI fetch be proxied at LOD request time, and not cached > permanently, to save memory? > > One interesting issue is the fact that the instance of OAI2LOD needs > to be started with some background data. That makes an automatic > gateway difficult, unless there is some way of extracting the data > from the OAI server itself. Clearly we would like to use OAI archives as LOD sites, by asking them to publish as LD. Since the eprints team coincidentally happens to also be at Southampton, if I knew exactly how to spec that, I'm sure I could get them to. But until we have some really good spec (which rich ontology [dc is not enough - see below], what URIs?...) that will apply in general, it has hard to feel I can ask them to put in the work. So the LD worker has to do the mediation. This means taking the metadata from the OA, and processing it. A script such as OAI2LOD is required. But what from? OAI-PMH specifies dc with optional extensions. >From the point of view of a rich LD site, this is disappointing. We really need to get at all that good metadata in the OA site in a standard way. As many LD people agree (I think?), normal dc output is difficult to use, as it lacks the detailed structure and identifiers that are bread and butter to LD. Our own approach was to ask the eprints team to output their archive in yet another form (after bibtex, endnote, refer, ...), which simply gave all metadata as xml. And they obliged. So it is possible (via URI) to acquire this for single papers or aggregates, (including their identifiers for people, publications, etc) and run a script to convert to RDF against our ontology. This output type is supported in eprints3. So we can provide eprintsXML2RDF.php, which is analogous to OAI2LOD, and we now harvest from other archives. And yes, we build a cache. So for example the publications listed by the person http://id.ecs.soton.ac.uk/person/2686 can also be found at http://southampton.rkbexplorer.com/id/person-02686 Should we? Probably not. But I think we are still exploring how to do these things. Once we really know, it should be possible to push back to a wrapper, and then back to the OAI providers for them to publish as LD (do you agree, B&B?). Also, we need to do a lot of processing over the OA entries to establish the linkage. When we run our co-reference identification tools, by following person-02686 above, you will find that even the original, carefully-curated eprints site has 8 extra string versions of this author (before he joined Southampton). And then we have identified 5 other LD sites with this author in them, totalling over 70 URIs. I suspect we would have to cache the data to do this level of processing, but we have not yet done the experiments. > > 2. Assuming now that you do have to run a separate OAI2LOD instance > for each OAI endpoint, do you think it would a good idea to make the > convention that the URI > > oai:lcoa1.loc.gov:loc.gdc/gcfr.0018_0163 > > is served from a server at a DNS ("oai" dot (the DNS name in the OAI > URI))? Like > > > http://oai.lcoa1.loc.gov/resources/item/oai:lcoa1.loc.gov:loc.gdc/gcfr. > > or even maybe like > > http://oai.lcoa1.loc.gov/item/loc.gdc/gcfr. > > One could build into clients a mapping redirection, or in the short > term configure a generic proxy to do the redirection and configure > existing browsers to use that proxy for the oai: scheme. It would > only happen when following an oai: link, as after that the client > would be in the world of http: names. I'm sure some convention like this would be good. But we should not lose sight of the fact there are a lot of other things to be identified in an OAI repository. Having URIs for people, in particular, is crucial. As far as I can see, the ability to uniquely identify an author and editor has not been a strong issue in the OAI community, and we need to encourage it. > > > 3. The use of "sameAs" to link the same work in different > repositories. Is that really what you mean? It allows any properties > of one URI to be associated to the other URI. So you can't have any > properties about the work which only apply to that repository, like > curation, persistence, etc > I have created a sameWorkAs to get around this problem, in the generic > resource ontology > http://www.w3.org/2006/gen/ont#sameWorkAs > SameWorkAs should allow one to transfer properties of the generic > resource, like copyright holder, author, genre. But not language, > curator, byte length, delivery format, etc, which vary repository by > repository would not transfer across sameWorkAs. Thanks Tim, I was not aware of this (sorry!). This is clearly a great step in the right direction - and I can see how we can use our CRS architecture to generate sameWorkAs. > > The TAG discussed this issue recently. > > I'm on a plane or I would be tempted to try out OAI2LOD directly. > (MacKenzie, have you tried this on MIT Dspace?) > > Tim > > As I said at the start, the OAI community seem to be going through a similar stage to the LOD community, investigating interoperability, harvesting, applications, aggregators, etc.. (For the latest update see the recent Open Repositories 2008 (http://or08.ecs.soton.ac.uk/ ). Were the LOD community more mature, we could simply suggest they publish in the right form for us, and then use our stuff! Or maybe they are ahead of us, and we should use theirs? Fortunately there are people who are in both communities, as I believe that many of the same problems are being solved here*, and we must work to ensure that we learn from each other. Thank you if you got this far. Hugh -- Hugh Glaser, Reader Dependable Systems & Software Engineering School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045 Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652 http://www.ecs.soton.ac.uk/~hg/ * Our work on the co-reference identification and storage (CRS) was originally inspired by a thought experiment with Les Carr on what would happen when all these institutions had their own repositories. The need was not apparent, so we have deployed it with the LD. I suspect the need is now becoming apparent.
Received on Sunday, 27 April 2008 16:42:08 UTC