[INFO] |
Channel view for “#simile”
opened. |
=== |
Highest
connection count: 57 (56 clients) |
-->| |
YOU
have joined #simile |
=-= |
Topic for #simile
is “simile pi teleconf - em to be 10 min late :(” |
=-= |
Topic for #simile
was set by em on Fri Aug 01 2003 08:55:45 GMT-0700 (PDT) |
-->| |
marbut_
(marbut@192.6.19.190)
has joined #simile |
|<-- |
marbut
has left irc.w3.org (Connection reset by peer) |
em |
dialing |
marbut_ |
http://www.oclc.org/research/projects/rdf_interop/index.shtm |
=-= |
em has changed the topic to
“simile pi teleconf” |
marbut_ |
http://wip.dublincore.org/source.html |
marbut_ |
http://wip.dublincore.org:8080/interop/searchServlet |
marbut_ |
KS: working on two things. One a memorandum of
understanding for support for the project, the other is reading about
doi to talk to John Ericsson about genesis |
em |
http://www.w3.org/2002/04/12-amico/ |
marbut_ |
em: I'm still progressing on the sample data - see
previous URL |
marbut_ |
I've not heard back from the edutella folks nor the
CIDOC folks |
marbut_ |
There's a small collection of AMICO data available
though |
em |
http://sh.webhire.com/servlet/av/jd?ai=631&ji=1274969&sn=I |
marbut_ |
As regards the hire, we are online and off the w3c
homesite with a pointer to the W3C position |
marbut_ |
Mark: Next item - staged demostrators - any feedback? |
marbut_ |
KS: The result we'll want to follow on with the demo
develop. |
marbut_ |
Mark: So what's the best way to do the persistant
store bit? |
marbut_ |
KS: You can start with Jena and add stuff on it, or
start with genesis, which has a slightly different api. |
marbut_ |
There are some limits on the complexity of the graphs
in genesis, we need to do some more work on a higher level |
marbut_ |
object api. We are working on this, but you need to
figure if the higher level objects here are satisfactory. |
marbut_ |
But what I anticipate you'll want to do is to start
use Jena as a back end. That's how I anticipate it going. |
marbut_ |
But in between, we'll try to make things compatible,
this helps with the APIs |
marbut_ |
we have an alpha level implementation of the first
level of genesis abstraction, how distribution is done, differences
between local and remote |
marbut_ |
searches, but as I understood it distribution is not
so important to the first demo |
marbut_ |
so I was planning on reserving the ability to do
distribution, not implement distribution right now, although I have |
marbut_ |
and implementation, |
marbut_ |
em: I think its a good idea, its a small,
accomplishable demostrator, we can use it to tease out the team
interaction, |
marbut_ |
it gives us some idea to compare Jena and Genesis. If
I understand your diagrams, then some of the query / inference layers |
marbut_ |
could be in the persistant store. |
marbut_ |
In the OCLC project we did this by emacs, doing it
with editors might be interesting, but this seems scoped so we can have
an early end date |
marbut_ |
but I was hoping before christmas. |
marbut_ |
mark: I'm hoping to do this before the hires are in
place. |
marbut_ |
em: let me offer me some lessons learnt from the OCLC
project |
marbut_ |
when we asked for the data, we didn't ask if we could
publish it, or make it available to others |
marbut_ |
we need to make it clear that we want to make the
data available, for other implementations, |
marbut_ |
also there was a tremendous amount of data management
that had to go on |
marbut_ |
e.g. xml was invalid, we tried to get diverse
datasets, but we still had to do data cleanup, so we need to think
about this also |
marbut_ |
the other thing was picking your data, the focus was
on diversity of datasets, since the datasets were so small the specific
overlaps |
marbut_ |
were quite hard to teaseout, so while the theory is
good trying to integrate small collections of diverse data was hard |
marbut_ |
because in practice no-one is going to search that
stuff. We need to get complimentary collections that |
marbut_ |
do have some overlap. I think the type of collections
we are looking at are going to be better. |
marbut_ |
The other thing we got burned on was performance. The
way we did inference was more along the lines of oring, |
marbut_ |
but the performance was very poor. For example
imagine that rss.title is a subproperty of dc.title |
marbut_ |
so say you want to search of dc.title="computers"
then you search for all the resources that dc.title="computers" or
rss.title="computer" |
marbut_ |
so it was done at the query level, not below, e.g.
forward vs backward chaining. |
marbut_ |
The problem was with a 1000 records, and 4 or 5
subproperty relations, the performance became very slow, so it was
taking 6 or 7 secs responses |
marbut_ |
so the last thing we learned was this was a
compelling example, that even with the delays, even with subproperty /
equality relationships |
marbut_ |
it was compelling for groups trying to integrate data
from lots of collections. |
marbut_ |
mark: does it use a specific query tool in Jena? |
marbut_ |
em: no, it doesn't use rdql, before OCLC started to
use Jena, it had a toolkit called EOR that was similar |
marbut_ |
we had some fancy backend table representations for
managing large scale triple stores |
marbut_ |
e.g. s-p-o, the later one took Sergey Melniks work,
so we had routines that could work with a model or with a backend
relational |
marbut_ |
data store, and created an API that worked with
database, that created SQL queries to run those over the database |
marbut_ |
em: i think lots of things were slowng this down, |
marbut_ |
ks: I'm not sure how we can avoid doing ors |
marbut_ |
em: I have some suggestions, but the project was
focussed on getting something up |
marbut_ |
it got a lot of interest, but it didn't move forward
at OCLC |
marbut_ |
one other lesson learnt, that gets back to genesis,
there are 2 ways of viewing this - one of the areas we were exploring
after that |
marbut_ |
was at data ingestion time to add the inference, so
you cache the inferences |
marbut_ |
ks: that's the approach that haystack uses |
marbut_ |
but it makes it harder to on-the-fly changes to
equivalence |
marbut_ |
doing it even adenine style means you have to do a
batch update |
marbut_ |
em: yes, tradeoffs either way - for the applications
that oclc was dealing with, not seeing realtime results for |
marbut_ |
changing the mapping wasn't important, but of course
you create a lot more data |
marbut_ |
in this 3 month pilot, the majority of the time was
spent data massaging |
marbut_ |
ks: I think best way to do this would be to have
built in support for contains |
marbut_ |
ks: keyword search has been done though, its the
inference that causes the problem, but I'm not sure if I can think of a
good way to do inference |
marbut_ |
em: yes, but thats why it may be important. when we
see it working, we may think of optimizations. It will tease out how |
marbut_ |
to merge controlled vocabularies and how to merge
indicies. So this is a useful scoped project to do this. |
|<-- |
marbut_ has left irc.w3.org (Client exited) |