Index of /2002/04/12-amico

KS: working on two things. One a memorandum of understanding for support for the project, the other is reading about doi to talk to John Ericsson about genesis

em

http://www.w3.org/2002/04/12-amico/

marbut_

em: I'm still progressing on the sample data - see previous URL

marbut_

I've not heard back from the edutella folks nor the CIDOC folks

marbut_

There's a small collection of AMICO data available though

em

http://sh.webhire.com/servlet/av/jd?ai=631&ji=1274969&sn=I

marbut_

As regards the hire, we are online and off the w3c homesite with a pointer to the W3C position

marbut_

Mark: Next item - staged demostrators - any feedback?

marbut_

KS: The result we'll want to follow on with the demo develop.

marbut_

Mark: So what's the best way to do the persistant store bit?

marbut_

KS: You can start with Jena and add stuff on it, or start with genesis, which has a slightly different api.

marbut_

There are some limits on the complexity of the graphs in genesis, we need to do some more work on a higher level

marbut_

object api. We are working on this, but you need to figure if the higher level objects here are satisfactory.

marbut_

But what I anticipate you'll want to do is to start use Jena as a back end. That's how I anticipate it going.

marbut_

But in between, we'll try to make things compatible, this helps with the APIs

marbut_

we have an alpha level implementation of the first level of genesis abstraction, how distribution is done, differences between local and remote

marbut_

searches, but as I understood it distribution is not so important to the first demo

marbut_

so I was planning on reserving the ability to do distribution, not implement distribution right now, although I have

marbut_

and implementation,

marbut_

em: I think its a good idea, its a small, accomplishable demostrator, we can use it to tease out the team interaction,

marbut_

it gives us some idea to compare Jena and Genesis. If I understand your diagrams, then some of the query / inference layers

marbut_

could be in the persistant store.

marbut_

In the OCLC project we did this by emacs, doing it with editors might be interesting, but this seems scoped so we can have an early end date

marbut_

but I was hoping before christmas.

marbut_

mark: I'm hoping to do this before the hires are in place.

marbut_

em: let me offer me some lessons learnt from the OCLC project

marbut_

when we asked for the data, we didn't ask if we could publish it, or make it available to others

marbut_

we need to make it clear that we want to make the data available, for other implementations,

marbut_

also there was a tremendous amount of data management that had to go on

marbut_

e.g. xml was invalid, we tried to get diverse datasets, but we still had to do data cleanup, so we need to think about this also

marbut_

the other thing was picking your data, the focus was on diversity of datasets, since the datasets were so small the specific overlaps

marbut_

were quite hard to teaseout, so while the theory is good trying to integrate small collections of diverse data was hard

marbut_

because in practice no-one is going to search that stuff. We need to get complimentary collections that

marbut_

do have some overlap. I think the type of collections we are looking at are going to be better.

marbut_

The other thing we got burned on was performance. The way we did inference was more along the lines of oring,

marbut_

but the performance was very poor. For example imagine that rss.title is a subproperty of dc.title

marbut_

so say you want to search of dc.title="computers" then you search for all the resources that dc.title="computers" or rss.title="computer"

marbut_

so it was done at the query level, not below, e.g. forward vs backward chaining.

marbut_

The problem was with a 1000 records, and 4 or 5 subproperty relations, the performance became very slow, so it was taking 6 or 7 secs responses

marbut_

so the last thing we learned was this was a compelling example, that even with the delays, even with subproperty / equality relationships

marbut_

it was compelling for groups trying to integrate data from lots of collections.

marbut_

mark: does it use a specific query tool in Jena?

marbut_

em: no, it doesn't use rdql, before OCLC started to use Jena, it had a toolkit called EOR that was similar

marbut_

we had some fancy backend table representations for managing large scale triple stores

marbut_

e.g. s-p-o, the later one took Sergey Melniks work, so we had routines that could work with a model or with a backend relational

marbut_

data store, and created an API that worked with database, that created SQL queries to run those over the database

marbut_

em: i think lots of things were slowng this down,

marbut_

ks: I'm not sure how we can avoid doing ors

marbut_

em: I have some suggestions, but the project was focussed on getting something up

marbut_

it got a lot of interest, but it didn't move forward at OCLC

marbut_

one other lesson learnt, that gets back to genesis, there are 2 ways of viewing this - one of the areas we were exploring after that

marbut_

was at data ingestion time to add the inference, so you cache the inferences

marbut_

ks: that's the approach that haystack uses

marbut_

but it makes it harder to on-the-fly changes to equivalence

marbut_

doing it even adenine style means you have to do a batch update

marbut_

em: yes, tradeoffs either way - for the applications that oclc was dealing with, not seeing realtime results for

marbut_

changing the mapping wasn't important, but of course you create a lot more data

marbut_

in this 3 month pilot, the majority of the time was spent data massaging

marbut_

ks: I think best way to do this would be to have built in support for contains

marbut_

ks: keyword search has been done though, its the inference that causes the problem, but I'm not sure if I can think of a good way to do inference

marbut_

em: yes, but thats why it may be important. when we see it working, we may think of optimizations. It will tease out how

marbut_

to merge controlled vocabularies and how to merge indicies. So this is a useful scoped project to do this.

|<--

marbut_ has left irc.w3.org (Client exited)