- From: Richard Newman <rnewman@franz.com>
- Date: Sat, 16 Aug 2008 14:26:32 -0700
- To: Lee Feigenbaum <lee@thefigtrees.net>
- Cc: public-sparql-dev@w3.org
Hi Lee,
Answering from AllegroGraph's perspective:
> 1) What is the search syntax of these predicates? For example, the
> object of Glitter's textmatch is a Lucene search string. I think
> (but am not sure) that ARQ is the same, and I'm not sure about the
> others.
AllegroGraph actually has two predicates: match and matchExpression.
The former uses a simple grammar[1]; basically strings with escaping,
phrases, and wildcards. As far as I can tell this is very similar to
Lucene's syntax, but I don't think we support all of the same features
(such as proximity or boosting).
The latter is first parsed as a Lisp expression (containing simple
matches) which is interpreted by the free-text system, allowing you to
do fun things like
?x fti:matchExpression '(or "RDFS" "OWL")' .
This allows SPARQL queries to do everything you can do from Lisp.
> 2) Do we have any hope of reconciling these to promote more
> interoperable queries of this sort? At the least, are implementors
> willing to support all 4 of these predicates (and perhaps others)
> interchangeably?
I would be happy to support others... but just like with RDF
vocabulary terms, there has to be sufficient overlap of meaning.
(E.g., if Glitter's match predicate supported fuzzy searches,
AllegroGraph would be lying if it simply accepted oa:textmatch as if
it were fti:match.)
> 3) Is there any value in coining an "implementation-independent" URI
> for textsearch and adding that to existing implementations?
There would be value if it were defined as the lowest-common-
denominator: probably sacrificing some extensions for widespread
adoption. I haven't done the work to find out, but if there is a large
amount of overlap between all of our implementations, I'd definitely
support the idea.
(In reality, successfully using this kind of extension requires some
knowledge of the implementation, so I don't really expect users to be
running extended queries on multiple implementations without changes.
That's not to say that standardizing the name is a bad thing.)
> 4) Do existing implementations compile simple invocations of the
> SPARQL regex filter function into uses of text-search indexes?
AllegroGraph does not. We do compile regexes, and do static analysis
on them. The free-text index, though, is separate from the triples
themselves, and not applicable to every literal in the store -- users
choose which predicates to include in the index (when you've got
several gigs of strings in your store, you probably don't want them
all indexed!). It would be possible to analyze a regex expression, a
query body, and the contents of the text index to decide whether to
use a filter or the free-text index, but we haven't yet been motivated
to do so.
> Is regex(...) the best way to interoperably _and_ efficiently
> perform SPARQL text match queries? (This has come to light in the
> recent Berlin benchmark SPARQL queries.)
REGEX is probably not the most efficient way to do text match queries,
though it's the only interoperable way. Our customers tend to use
fti:match with an additional REGEX if necessary -- they have control
over which predicates they index, and can get dramatic speedups, so
they take the interoperability hit.
> From my point of view as an implementor, I'd be happy to support
> other predicates and/or an agreed upon implementation-neutral
> predicate in Glitter, though I'd want to be clear on the syntax of
> the search string itself.
I concur.
> Glitter doesn't currently compile regex(...) into anzo:textmatch,
> but I've been intending to add that support in the light of the
> Berlin query benchmark suite.
Heh, benchmark-driven development :D
I agree that there is a need for this kind of thing.
One area I've been pondering is more complicated use of text-indexing;
for example
?x fti:match "foo"
doesn't tell you which predicate actually included the value (or,
indeed, the original object value). That's fine for a lot of use
cases, but not all -- I've seen customers doing something like this:
?x fti:match "foo"; # Quickly find relevant ?xs
my:predicate ?o . # Then try to find the right triple.
FILTER (regex(?o, "foo"))
to extract more information.
I have some syntax ideas (and possibly even working code; this was
some time ago) to solve this -- I don't think this is a natural fit
for computed property syntax. Something along the lines of
TEXTINDEX "foo" {
?x my:predicate ?o
}
allows specification of every part of the triple, and also permits
more syntax (e.g., expression trees in place of the literal without
all the escaping).
-R
[1] <http://agraph.franz.com/support/documentation/current/reference-guide.html#ref-freetext-query-grammar
>
Received on Saturday, 16 August 2008 21:27:17 UTC