More IRIX and Searching (LONG)

Hi

Well, if you want me to produce an IRIX and Jigsaw page I'll be more
than happy to ;-)

As for the records views stuff, I'll mirror my source tree soon for
people to have a look at....with the understanding that all of it is
work-in-progress !

****
SEARCHING
****

For some time now, I have been working on the following problem :

There are good database search engines. There are good WWW search
engines. There is not a good WWW *and* Database search engine.

Here are some thoughts (disorganised and rambling !) and some
suggestions for a possible implementation.

Why do we need one ? OO databases are very good for some queries, and
Jigsaw is a very interesting OO database. Given the CVS stuff mentioned
in Anselm's last reply, it is exactly what we want for our user
submission/annotation work. However, OO databases tend to be poor at
card-file type data. MIDRIB, my current project (well, the one that
pays me !) is a database of medical images. Principally, it consists of
a number of image collections from various donors. It is targeted at
education, and hence the ability for teachers to make their own
arbitary collections is very important. Hence, an OO database sounds
really cool. We want complex searching over field restricted data as
well tho'. We need complex database admin tools. We do semantic (ULMS)
and thesauri expansion. We have had Indexplus thrust apon us as the
database to use. ( Fair enough, it recently got a BCS medal for the
sparse leaf B-tree varient it uses, and it is blindingly fast at free 
text searches). I would like MIDRIB to be more than just a card file
and arbitary collection system. I want to put hypertext, and my own
side project Tutorial Markup Language ( meta HTML-DTD supporting
question semantics ) in.

So, my ideal system is an OO database with card file capabilities and a
sophisticated semantic search engine....

Let us take Jigsaw, add a defined search API to it's resources, and
ensure that the search API can hand searches off to underlying
databases with high effiency search engines.

In addition, let us define a search result mechanism by which we can
flexibly but comprehensively report search results from possibly
foreign databases.

How do we do this ? Well, there are some existing standards out there
which we can use. Dublin Core (is it my imagination or does the meta
data document on w3c use this as an example. ? ;-) ) is a meta data
standard which is evolving. Whois++ gives a cross database search
potential. The document 'draft-ietf-asid-whois-schema-00.txt' gives a
meta-data over whois++ defined set of templates. WAIS defines some
complex result reporting. All of these will prove useful. None of them
may be used directly.

I have a four layer model for each collection to use (remote sites are
just remote collections, a collection is a searchable set of records). 
It comprises of the following replies to search requests (to be
discussed below) :

1) I exist
2) I exist and may contain x many relevent objects
3) I exist and contain x many relevent objects - here is the meta data
4) I exist, here is the hit meta data, and I can respond with the
   objects in protocol x,y or z
   
Obviously, the TTL and degree of confidence etc needs to be transmitted
as well. All replies should also contain the modifications to the
search - i.e. the difference between the requested search and what is
actually performed.

This reply structure allows systems to interoperate at both a very
minimal level, and at a level where one unified user interface provides
all the search results from a wide range of different resources.

Search term description I think needs to be defined very carefully,
again with different levels of complience, from simple 'containing this
word' to a more complex expression evaluator (I have the skeleton of
one in development). I suggest using a text string representation and a
common parser core to a tree representation, which may then generate
e.g. an SQL query to an underlying database. So we could see something
like this :

'Title=~Mobile Code and Network Applets' AND 'JAVA' AND NOT 'JIGSAW';
CaseSensitive=YES; Expand=Synonyms,Stemming ; Hints=NO

Search results need to be carefully thought out too - we are reporting
results like this (not real syntax!) :

Results: 4000 ( best is 80% ) 
Original Search: 'equine' AND 'heart' ; CaseSensitive=no; Hints=YES
Actual search : 'equine' and ( 'CARDIAC' OR 'HEART' ) ;
CaseSensitive=NO; Hints=YES
Operations performed : Synonyn expansion 'HEART' -> CARDIAC;
                            ULMS_vocablery.
Hints : Synonym expansion 'EQUINE' -> HORSE,FOAL,GELDING ; ULMS
vocablery,
        Database OMNI HTTP://omni.ac.uk/ , 
        Semantic net : VASCULAR; ULMS_vocablery
 
Note that Results rating needs to be formalised, as do the other
functions such as thesurus lookup etc. Note that I return the source
used for lookups as well.
 
Other issues are caching search lookup, mirroring indicies, caching
search results, parallel searching etc. 

I'd like to experiment with adding some of these facilities to the
basic Jigsaw Resource class... particularly as it has a lot of the
service code already installed ;-)

Some of these ideas had been implemented in my 'pre-jigsaw' clone
...now replaced by the real thing ;-) The structure of which had the
same 'container' tree look, but also had a link database. Creavat -
until recently I was the only person working on this - and part time at
that..

Hopefully out of this message comes some interesting discussion....
 
Apologies for english, spelling, daftness of ideas, duplication of
other work as appropriate
 
Joel 
-- 
Joel.Crisp@bris.ac.uk | ets-webmaster@bris.ac.uk  | "I remember Babylon" -
Software Engineer, Institute of Learning and      |        Arthur C Clarke
Research Technology, University of Bristol, UK    |
http://www.ets.bris.ac.uk/                        |

Received on Wednesday, 27 November 1996 11:32:53 UTC