RE: 'standardizing' on one or more predicates for text search in SPARQL? from Seaborne, Andy on 2008-08-17 (public-sparql-dev@w3.org from July to September 2008)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Sun, 17 Aug 2008 16:16:53 +0000
To: Lee Feigenbaum <lee@thefigtrees.net>, "public-sparql-dev@w3.org" <public-sparql-dev@w3.org>
Message-ID: <B6CF1054FDC8B845BF93A6645D19BEA34B056C4948@GVW1118EXC.americas.hpqcorp.net>


> -----Original Message-----
> From: public-sparql-dev-request@w3.org [mailto:public-sparql-dev-
> request@w3.org] On Behalf Of Lee Feigenbaum
> Sent: 16 August 2008 21:50
> To: public-sparql-dev@w3.org
> Subject: 'standardizing' on one or more predicates for text search in
> SPARQL?
>
>
> Many SPARQL engines contain support for a magic/computed/functional
> predicate that can be used to relate a literal subject (?o if you will)
> to a text search string.
>
> See http://esw.w3.org/topic/SPARQL/Extensions/Computed_Properties for
> links to some examples.
>
> Right now, different implementations use different predicates. As far as
> I can tell:
>
> ARQ (Jena): http://jena.hpl.hp.com/ARQ/property#

> Virtuoso: bif:contains  (though I can't tell what prefix bif:
> corresponds to)
> Glitter (Open Anzo): http://openanzo.org/predicates/textmatch

> AllegroGraph: http://franz.com/ns/allegrograph/2.2/textindex/match


ARQ didn't invent property functions.  Before that was cwm which has a predicate mechanism which it uses (amongst other things) for regexes with "string:matches".

> A couple of questions:
>
> 1) What is the search syntax of these predicates? For example, the
> object of Glitter's textmatch is a Lucene search string. I think (but am
> not sure) that ARQ is the same, and I'm not sure about the others.

ARQ documentation: http://jena.sourceforge.net/ARQ/lucene-arq.html

ARQ uses Lucene for all the real work - the free text index and the syntax is that for Lucene language (AND, OR, proximity, fuzzy match).

http://lucene.apache.org/java/2_3_2/queryparsersyntax.html


The simple form is:
    ?lit pf:textMatch '+text' .

The search string is Lucene syntax (and is a passed unchanged to Lucene).

This form can also be used for the finding documents that have content that matches the search:
    ?uri pf:textMatch '+text' .

because it not fixed that the Lucene index contains the associated literal.  It could be the text to the URI of the document causing the match.

A constant (or already bound) subject simply requires an exact match to the index value.  So the index can be used as a restrictive or generative index.

The most complex form uses RDF lists for arguments:
  # Limit to scores of 0.5 and limit to 100 hits (object slot)
  # Return the literals matched and the score (subject slot)
  (?lit ?score ) pf:textMatch ( '+text' 0.5 100 ) .

For free text, the subject slot are outputs, the object slot inputs.  Inputs can be variables but must be already bound by the time a call is made.  Fixed outputs are matched for equality. Because not all functions, in practical terms, work both ways, there has to be additional rules for evaluation of property functions as BGPs.

> 2) Do we have any hope of reconciling these to promote more
> interoperable queries of this sort? At the least, are implementors
> willing to support all 4 of these predicates (and perhaps others)
> interchangeably?

Yes although I'd rather implement a commonly agree one rather than 4 (slightly different) forms.

>
> 3) Is there any value in coining an "implementation-independent" URI for
> textsearch and adding that to existing implementations?

Yes - it would be valuable to have a common form that covers the basic cases and is independent of implementation technology.

It's probably more valuable to be the simple(r) case to increase the number of implementations.  So more complex argument forms, and more complex text searches shouldn't be covered as being mandatory.

Ditto text matching language.  A core more-widely available form is more useful than a less widely provided complex language.

> 4) Do existing implementations compile simple invocations of the SPARQL
> regex filter function into uses of text-search indexes? Is regex(...)
> the best way to interoperably _and_ efficiently perform SPARQL text
> match queries? (This has come to light in the recent Berlin benchmark
> SPARQL queries.)

ARQ does not equate the two.  A regex is a yes/no exact match; free text is a best match (hence scores and needing to limit the number of hits).  That makes defining the right answers for an agreed predicate rather hard - there are tradeoffs in the free text engine to be made.

There are lots of things that can be done to speed up regex and we have a standard regex language from XSD but it's not free text searching.

>  From my point of view as an implementor, I'd be happy to support other
> predicates and/or an agreed upon implementation-neutral predicate in
> Glitter, though I'd want to be clear on the syntax of the search string
> itself. Glitter doesn't currently compile regex(...) into
> anzo:textmatch, but I've been intending to add that support in the light
> of the Berlin query benchmark suite.
>
> Lee

I'd like to see something that can be widely supported.  Let's also recognize that there is a cost - a split between "large" and "small" implementations would be a bad thing.

Hope that helps,

        Andy
Received on Sunday, 17 August 2008 16:19:53 UTC