Re: More fulltext advocacy (was Re: Lee's feature proposal) from Kjetil Kjernsmo on 2009-05-04 (public-rdf-dawg@w3.org from April to June 2009)

From: Kjetil Kjernsmo <Kjetil.Kjernsmo@computas.com>
Date: Mon, 4 May 2009 18:31:22 +0200
To: public-rdf-dawg@w3.org
Message-Id: <200905041831.22626.Kjetil.Kjernsmo@computas.com>

John,

Thank you very much for your support!

On Monday 04 May 2009 16:23:12 Clark, John wrote:
> I agree, and I think it's a useful exercise to try to standardize "general
> text search", perhaps even for consumption by technologies other than
> SPARQL.

Possibly, but I care first and foremost about SPARQL :-) If anybody else has 
any use for it, I'd say fine.

> > All we have used so far can be summarised as follows:
> > 1) Terms shorter than three characters are ignored.
>
> So, with this feature, query string "Amazon S3" would be equivalent to
> "Amazon" and query string "theorems about ?" would be equivalent to
> "theorems about", correct?  This makes me uneasy.

Yeah, it has some drawbacks, clearly. I think it is mostly a practical matter, 
as far as I know, this restriction exists in LARQ, Virtuoso, MySQL to name a 
few I've worked with. It is painful at times, but I guess that it is simply 
too time-consuming to create an index that will match any two-letter 
combinations?

> > 2) a single terms is matched exactly against a whole word.
> > 3) a single term ending in asterisk is matched against words beginning
> > with the term.
> > 4) multiple terms with AND matches all words in any order.
> > 5) multiple terms with OR matches any words in any order.
> > 6) multiple terms without an operator matches all words in the given
> > order.
> >
> > At some point, we had phrase search too, which is a nice feature but I
> > think we dropped it.
>
> I think this is a reasonable set, but I'd also like to approach it slightly
> differently and try to standardize what already exists (and thus is
> reasonably "well understood" by users).

Thank you! 

> That is, I'd suggest standardizing 
> generalized text search as "what Google does", 

Well, some of what "what Google does" could be 
http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=136861
and indeed, I think some of that is quite reasonable, but I don't know if it 
is right for us.

> including phrase search with 
> quotes, term negation, and query extensions with syntax like "loc:
> cleveland, ohio" (e.g. in Google maps).

Hmmm, I think we might end up standardising a bit too much of CQL (which is 
quite nice and a nice complement to SPARQL in many situations):
http://www.loc.gov/standards/sru/specs/cql.html
Also, I don't think loc: would belong in the object, since that is a predicate 
for us, and I feel that such specific things belong in a application layer 
that translates to SPARQL. Also, with property paths, we might be able to say 
stuff like "geo:location or any sub properties". 

Anyway, I hope we can discuss this a bit further on Wednesday. My agenda here 
is to constrain the feature so that it is a useful feature, yet something 
that will not take a lot of WG time and not a lot of time for implementers.

Kind regards 

Kjetil Kjernsmo
-- 
Senior Knowledge Engineer
Mobile: +47 986 48 234
Email: kjetil.kjernsmo@computas.com   
Web: http://www.computas.com/

|  SHARE YOUR KNOWLEDGE  |

Computas AS  PO Box 482, N-1327 Lysaker | Phone:+47 6783 1000 | Fax:+47 6783 
1001

Received on Monday, 4 May 2009 16:31:51 UTC