[INFO] Channel view for “#simile” opened.
=== Highest connection count: 57 (56 clients)
-->| YOU have joined #simile
=-= Topic for #simile is “simile pi teleconf - em to be 10 min late :(
=-= Topic for #simile was set by em on Fri Aug 01 2003 08:55:45 GMT-0700 (PDT)
-->| marbut_ (marbut@192.6.19.190) has joined #simile
|<-- marbut has left irc.w3.org (Connection reset by peer)
em dialing
marbut_ http://www.oclc.org/research/projects/rdf_interop/index.shtm
=-= em has changed the topic to “simile pi teleconf”
marbut_ http://wip.dublincore.org/source.html
marbut_ http://wip.dublincore.org:8080/interop/searchServlet
marbut_ KS: working on two things. One a memorandum of understanding for support for the project, the other is reading about doi to talk to John Ericsson about genesis
em http://www.w3.org/2002/04/12-amico/
marbut_ em: I'm still progressing on the sample data - see previous URL
marbut_ I've not heard back from the edutella folks nor the CIDOC folks
marbut_ There's a small collection of AMICO data available though
em http://sh.webhire.com/servlet/av/jd?ai=631&ji=1274969&sn=I
marbut_ As regards the hire, we are online and off the w3c homesite with a pointer to the W3C position
marbut_ Mark: Next item - staged demostrators - any feedback?
marbut_ KS: The result we'll want to follow on with the demo develop.
marbut_ Mark: So what's the best way to do the persistant store bit?
marbut_ KS: You can start with Jena and add stuff on it, or start with genesis, which has a slightly different api.
marbut_ There are some limits on the complexity of the graphs in genesis, we need to do some more work on a higher level
marbut_ object api. We are working on this, but you need to figure if the higher level objects here are satisfactory.
marbut_ But what I anticipate you'll want to do is to start use Jena as a back end. That's how I anticipate it going.
marbut_ But in between, we'll try to make things compatible, this helps with the APIs
marbut_ we have an alpha level implementation of the first level of genesis abstraction, how distribution is done, differences between local and remote
marbut_ searches, but as I understood it distribution is not so important to the first demo
marbut_ so I was planning on reserving the ability to do distribution, not implement distribution right now, although I have
marbut_ and implementation,
marbut_ em: I think its a good idea, its a small, accomplishable demostrator, we can use it to tease out the team interaction,
marbut_ it gives us some idea to compare Jena and Genesis. If I understand your diagrams, then some of the query / inference layers
marbut_ could be in the persistant store.
marbut_ In the OCLC project we did this by emacs, doing it with editors might be interesting, but this seems scoped so we can have an early end date
marbut_ but I was hoping before christmas.
marbut_ mark: I'm hoping to do this before the hires are in place.
marbut_ em: let me offer me some lessons learnt from the OCLC project
marbut_ when we asked for the data, we didn't ask if we could publish it, or make it available to others
marbut_ we need to make it clear that we want to make the data available, for other implementations,
marbut_ also there was a tremendous amount of data management that had to go on
marbut_ e.g. xml was invalid, we tried to get diverse datasets, but we still had to do data cleanup, so we need to think about this also
marbut_ the other thing was picking your data, the focus was on diversity of datasets, since the datasets were so small the specific overlaps
marbut_ were quite hard to teaseout, so while the theory is good trying to integrate small collections of diverse data was hard
marbut_ because in practice no-one is going to search that stuff. We need to get complimentary collections that
marbut_ do have some overlap. I think the type of collections we are looking at are going to be better.
marbut_ The other thing we got burned on was performance. The way we did inference was more along the lines of oring,
marbut_ but the performance was very poor. For example imagine that rss.title is a subproperty of dc.title
marbut_ so say you want to search of dc.title="computers" then you search for all the resources that dc.title="computers" or rss.title="computer"
marbut_ so it was done at the query level, not below, e.g. forward vs backward chaining.
marbut_ The problem was with a 1000 records, and 4 or 5 subproperty relations, the performance became very slow, so it was taking 6 or 7 secs responses
marbut_ so the last thing we learned was this was a compelling example, that even with the delays, even with subproperty / equality relationships
marbut_ it was compelling for groups trying to integrate data from lots of collections.
marbut_ mark: does it use a specific query tool in Jena?
marbut_ em: no, it doesn't use rdql, before OCLC started to use Jena, it had a toolkit called EOR that was similar
marbut_ we had some fancy backend table representations for managing large scale triple stores
marbut_ e.g. s-p-o, the later one took Sergey Melniks work, so we had routines that could work with a model or with a backend relational
marbut_ data store, and created an API that worked with database, that created SQL queries to run those over the database
marbut_ em: i think lots of things were slowng this down,
marbut_ ks: I'm not sure how we can avoid doing ors
marbut_ em: I have some suggestions, but the project was focussed on getting something up
marbut_ it got a lot of interest, but it didn't move forward at OCLC
marbut_ one other lesson learnt, that gets back to genesis, there are 2 ways of viewing this - one of the areas we were exploring after that
marbut_ was at data ingestion time to add the inference, so you cache the inferences
marbut_ ks: that's the approach that haystack uses
marbut_ but it makes it harder to on-the-fly changes to equivalence
marbut_ doing it even adenine style means you have to do a batch update
marbut_ em: yes, tradeoffs either way - for the applications that oclc was dealing with, not seeing realtime results for
marbut_ changing the mapping wasn't important, but of course you create a lot more data
marbut_ in this 3 month pilot, the majority of the time was spent data massaging
marbut_ ks: I think best way to do this would be to have built in support for contains
marbut_ ks: keyword search has been done though, its the inference that causes the problem, but I'm not sure if I can think of a good way to do inference
marbut_ em: yes, but thats why it may be important. when we see it working, we may think of optimizations. It will tease out how
marbut_ to merge controlled vocabularies and how to merge indicies. So this is a useful scoped project to do this.
|<-- marbut_ has left irc.w3.org (Client exited)