RE: More fulltext advocacy (was Re: Lee's feature proposal)

Kjetil wrote:
> I think it would be very unfortunate to not have any standardised fulltext 
> capability in SPARQL, as it signals that "if you have a search box on your 
> site that is used extensively by your users, then SPARQL is not suitable for 
> you". Even if there are extensions that does free text, this is a message 
> that I for one, would be very concerned about as it is most of the current 
> web.

I agree, and I think it's a useful exercise to try to standardize "general text search", perhaps even for consumption by technologies other than SPARQL.

> We sometimes match strings with regular expressions, but never with exact 
> string match. Regular expressions are far too flexible to be useful in many 
> contexts.

Indeed, regular expressions are too powerful for this type of feature.  They require users to learn a new skill, which prevents them from being useful immediately, and in many cases from being used at all.  In addition, regular expressions have a strict syntax; that is, they accept only a subset of all possible strings.  Instead, I think it must be a design feature of general text search that any search string is acceptable.  Finally, I think a more restricted set of features would allow for a more efficient indexing implementation.

> All we have used so far can be summarised as follows:
> 1) Terms shorter than three characters are ignored.

So, with this feature, query string "Amazon S3" would be equivalent to "Amazon" and query string "theorems about ?" would be equivalent to "theorems about", correct?  This makes me uneasy.

> 2) a single terms is matched exactly against a whole word.
> 3) a single term ending in asterisk is matched against words beginning with 
> the term.
> 4) multiple terms with AND matches all words in any order.
> 5) multiple terms with OR matches any words in any order.
> 6) multiple terms without an operator matches all words in the given order.
> 
> At some point, we had phrase search too, which is a nice feature but I think 
> we dropped it.

I think this is a reasonable set, but I'd also like to approach it slightly differently and try to standardize what already exists (and thus is reasonably "well understood" by users).  That is, I'd suggest standardizing generalized text search as "what Google does", including phrase search with quotes, term negation, and query extensions with syntax like "loc: cleveland, ohio" (e.g. in Google maps).

Take care,

    John L. Clark

===================================

P Please consider the environment before printing this e-mail

Cleveland Clinic is ranked one of the top hospitals
in America by U.S. News & World Report (2008).  
Visit us online at http://www.clevelandclinic.org for
a complete listing of our services, staff and
locations.


Confidentiality Note:  This message is intended for use
only by the individual or entity to which it is addressed
and may contain information that is privileged,
confidential, and exempt from disclosure under applicable
law.  If the reader of this message is not the intended
recipient or the employee or agent responsible for
delivering the message to the intended recipient, you are
hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited.  If
you have received this communication in error,  please
contact the sender immediately and destroy the material in
its entirety, whether electronic or hard copy.  Thank you.

Received on Monday, 4 May 2009 14:24:04 UTC