More fulltext advocacy (was Re: Lee's feature proposal)

On Friday 01 May 2009 06:27:54 Lee Feigenbaum wrote:
>      * Full text. The survey indicated strong support for standardizing
> the syntax and semantics for full text search in SPARQL. While I believe
> that this is one of the top interoperability stumbling blocks for
> SPARQL, the wide-open design space (both for syntax and semantics) of
> the problem worries me.

Indeed it has a very open design space, but I think we should look into how 
fulltext search is used and let that guide the implementation now.

I think it would be very unfortunate to not have any standardised fulltext 
capability in SPARQL, as it signals that "if you have a search box on your 
site that is used extensively by your users, then SPARQL is not suitable for 
you". Even if there are extensions that does free text, this is a message 
that I for one, would be very concerned about as it is most of the current 
web. 

We sometimes match strings with regular expressions, but never with exact 
string match. Regular expressions are far too flexible to be useful in many 
contexts.

All we have used so far can be summarised as follows:
1) Terms shorter than three characters are ignored.
2) a single terms is matched exactly against a whole word.
3) a single term ending in asterisk is matched against words beginning with 
the term.
4) multiple terms with AND matches all words in any order.
5) multiple terms with OR matches any words in any order.
6) multiple terms without an operator matches all words in the given order.

At some point, we had phrase search too, which is a nice feature but I think 
we dropped it.

Here, there is no Xquery, a small subset of what Lucene does, there is no 
advanced stemming, just plain string matching, with some permutations of 
terms. Yet, it covers most of what people do in our experience.

Also, forward compatibility can be kept by defining different functions for 
different matching rules, we could have a simple contains function now, and 
SPARQL 1.2 could adopt ftcontains in addition if they so wish.

In summary, the design space can be constrained to something small, and while 
SPARQL does not need a very elaborate freetext matching system, it needs 
something, and much of it is allready there, it is mostly just a matter of 
naming a function or predicate normatively.

Kind regards 

Kjetil Kjernsmo
-- 
Senior Knowledge Engineer
Mobile: +47 986 48 234
Email: kjetil.kjernsmo@computas.com   
Web: http://www.computas.com/

|  SHARE YOUR KNOWLEDGE  |

Computas AS  PO Box 482, N-1327 Lysaker | Phone:+47 6783 1000 | Fax:+47 6783 
1001

Received on Monday, 4 May 2009 11:05:09 UTC