Re: 'standardizing' on one or more predicates for text search in SPARQL? from Richard Newman on 2008-08-16 (public-sparql-dev@w3.org from July to September 2008)

From: Richard Newman <rnewman@franz.com>
Date: Sat, 16 Aug 2008 14:26:32 -0700
To: Lee Feigenbaum <lee@thefigtrees.net>
Cc: public-sparql-dev@w3.org
Message-Id: <748E276E-8B03-442F-AC54-F830061E3C03@franz.com>
Hi Lee,

Answering from AllegroGraph's perspective:

> 1) What is the search syntax of these predicates? For example, the  
> object of Glitter's textmatch is a Lucene search string. I think  
> (but am not sure) that ARQ is the same, and I'm not sure about the  
> others.

AllegroGraph actually has two predicates: match and matchExpression.

The former uses a simple grammar[1]; basically strings with escaping,  
phrases, and wildcards. As far as I can tell this is very similar to  
Lucene's syntax, but I don't think we support all of the same features  
(such as proximity or boosting).

The latter is first parsed as a Lisp expression (containing simple  
matches) which is interpreted by the free-text system, allowing you to  
do fun things like

   ?x fti:matchExpression '(or "RDFS" "OWL")' .

This allows SPARQL queries to do everything you can do from Lisp.


> 2) Do we have any hope of reconciling these to promote more  
> interoperable queries of this sort? At the least, are implementors  
> willing to support all 4 of these predicates (and perhaps others)  
> interchangeably?

I would be happy to support others... but just like with RDF  
vocabulary terms, there has to be sufficient overlap of meaning.  
(E.g., if Glitter's match predicate supported fuzzy searches,  
AllegroGraph would be lying if it simply accepted oa:textmatch as if  
it were fti:match.)


> 3) Is there any value in coining an "implementation-independent" URI  
> for textsearch and adding that to existing implementations?

There would be value if it were defined as the lowest-common- 
denominator: probably sacrificing some extensions for widespread  
adoption. I haven't done the work to find out, but if there is a large  
amount of overlap between all of our implementations, I'd definitely  
support the idea.

(In reality, successfully using this kind of extension requires some  
knowledge of the implementation, so I don't really expect users to be  
running extended queries on multiple implementations without changes.  
That's not to say that standardizing the name is a bad thing.)


> 4) Do existing implementations compile simple invocations of the  
> SPARQL regex filter function into uses of text-search indexes?

AllegroGraph does not. We do compile regexes, and do static analysis  
on them. The free-text index, though, is separate from the triples  
themselves, and not applicable to every literal in the store -- users  
choose which predicates to include in the index (when you've got  
several gigs of strings in your store, you probably don't want them  
all indexed!). It would be possible to analyze a regex expression, a  
query body, and the contents of the text index to decide whether to  
use a filter or the free-text index, but we haven't yet been motivated  
to do so.


> Is regex(...) the best way to interoperably _and_ efficiently  
> perform SPARQL text match queries? (This has come to light in the  
> recent Berlin benchmark SPARQL queries.)

REGEX is probably not the most efficient way to do text match queries,  
though it's the only interoperable way. Our customers tend to use  
fti:match with an additional REGEX if necessary -- they have control  
over which predicates they index, and can get dramatic speedups, so  
they take the interoperability hit.


> From my point of view as an implementor, I'd be happy to support  
> other predicates and/or an agreed upon implementation-neutral  
> predicate in Glitter, though I'd want to be clear on the syntax of  
> the search string itself.

I concur.


> Glitter doesn't currently compile regex(...) into anzo:textmatch,  
> but I've been intending to add that support in the light of the  
> Berlin query benchmark suite.

Heh, benchmark-driven development :D

I agree that there is a need for this kind of thing.

One area I've been pondering is more complicated use of text-indexing;  
for example

   ?x fti:match "foo"

doesn't tell you which predicate actually included the value (or,  
indeed, the original object value). That's fine for a lot of use  
cases, but not all -- I've seen customers doing something like this:

   ?x fti:match "foo";          # Quickly find relevant ?xs
      my:predicate ?o .         # Then try to find the right triple.
   FILTER (regex(?o, "foo"))

to extract more information.

I have some syntax ideas (and possibly even working code; this was  
some time ago) to solve this -- I don't think this is a natural fit  
for computed property syntax. Something along the lines of

   TEXTINDEX "foo" {
     ?x my:predicate ?o
   }

allows specification of every part of the triple, and also permits  
more syntax (e.g., expression trees in place of the literal without  
all the escaping).

-R

[1] <http://agraph.franz.com/support/documentation/current/reference-guide.html#ref-freetext-query-grammar 
 >
Received on Saturday, 16 August 2008 21:27:17 UTC