- From: Richard Newman <rnewman@franz.com>
- Date: Sat, 16 Aug 2008 14:26:32 -0700
- To: Lee Feigenbaum <lee@thefigtrees.net>
- Cc: public-sparql-dev@w3.org
Hi Lee, Answering from AllegroGraph's perspective: > 1) What is the search syntax of these predicates? For example, the > object of Glitter's textmatch is a Lucene search string. I think > (but am not sure) that ARQ is the same, and I'm not sure about the > others. AllegroGraph actually has two predicates: match and matchExpression. The former uses a simple grammar[1]; basically strings with escaping, phrases, and wildcards. As far as I can tell this is very similar to Lucene's syntax, but I don't think we support all of the same features (such as proximity or boosting). The latter is first parsed as a Lisp expression (containing simple matches) which is interpreted by the free-text system, allowing you to do fun things like ?x fti:matchExpression '(or "RDFS" "OWL")' . This allows SPARQL queries to do everything you can do from Lisp. > 2) Do we have any hope of reconciling these to promote more > interoperable queries of this sort? At the least, are implementors > willing to support all 4 of these predicates (and perhaps others) > interchangeably? I would be happy to support others... but just like with RDF vocabulary terms, there has to be sufficient overlap of meaning. (E.g., if Glitter's match predicate supported fuzzy searches, AllegroGraph would be lying if it simply accepted oa:textmatch as if it were fti:match.) > 3) Is there any value in coining an "implementation-independent" URI > for textsearch and adding that to existing implementations? There would be value if it were defined as the lowest-common- denominator: probably sacrificing some extensions for widespread adoption. I haven't done the work to find out, but if there is a large amount of overlap between all of our implementations, I'd definitely support the idea. (In reality, successfully using this kind of extension requires some knowledge of the implementation, so I don't really expect users to be running extended queries on multiple implementations without changes. That's not to say that standardizing the name is a bad thing.) > 4) Do existing implementations compile simple invocations of the > SPARQL regex filter function into uses of text-search indexes? AllegroGraph does not. We do compile regexes, and do static analysis on them. The free-text index, though, is separate from the triples themselves, and not applicable to every literal in the store -- users choose which predicates to include in the index (when you've got several gigs of strings in your store, you probably don't want them all indexed!). It would be possible to analyze a regex expression, a query body, and the contents of the text index to decide whether to use a filter or the free-text index, but we haven't yet been motivated to do so. > Is regex(...) the best way to interoperably _and_ efficiently > perform SPARQL text match queries? (This has come to light in the > recent Berlin benchmark SPARQL queries.) REGEX is probably not the most efficient way to do text match queries, though it's the only interoperable way. Our customers tend to use fti:match with an additional REGEX if necessary -- they have control over which predicates they index, and can get dramatic speedups, so they take the interoperability hit. > From my point of view as an implementor, I'd be happy to support > other predicates and/or an agreed upon implementation-neutral > predicate in Glitter, though I'd want to be clear on the syntax of > the search string itself. I concur. > Glitter doesn't currently compile regex(...) into anzo:textmatch, > but I've been intending to add that support in the light of the > Berlin query benchmark suite. Heh, benchmark-driven development :D I agree that there is a need for this kind of thing. One area I've been pondering is more complicated use of text-indexing; for example ?x fti:match "foo" doesn't tell you which predicate actually included the value (or, indeed, the original object value). That's fine for a lot of use cases, but not all -- I've seen customers doing something like this: ?x fti:match "foo"; # Quickly find relevant ?xs my:predicate ?o . # Then try to find the right triple. FILTER (regex(?o, "foo")) to extract more information. I have some syntax ideas (and possibly even working code; this was some time ago) to solve this -- I don't think this is a natural fit for computed property syntax. Something along the lines of TEXTINDEX "foo" { ?x my:predicate ?o } allows specification of every part of the triple, and also permits more syntax (e.g., expression trees in place of the literal without all the escaping). -R [1] <http://agraph.franz.com/support/documentation/current/reference-guide.html#ref-freetext-query-grammar >
Received on Saturday, 16 August 2008 21:27:17 UTC