RE: More fulltext advocacy (was Re: Lee's feature proposal)

As concerns full text, I would just add the following:

- Order - For ranking on text hit score, we would have an explicit sorted
order by in the query. .  It is true that a text index can be 
built with order in the index itself but if one has a composite ranking
combining the text hits score and the document page rank or 
equivalent plus word proximities, one does end up having to sort the
results.  So preserving order is  not a big deal as I see it since an 
application will in practice have an order by clause with limit and offset
in the query.  

- Filtering - The SQL MM full text feature for example is expressed in terms
of filtering, the eventual existence of an index is hardly 
mentioned in the spec.  However, the index is the crucial point for using
the feature, however not for defining it.

- Syntax - I do not care whether the full text match looks like a triple
pattern or something else.  The important part is that it ought to 
be able to bind a score variable and possibly other "offband"variables, for
example for purposes of locating the text hit in the document, 

or for purposes of fetching other information colocated with the text index.
I would not expect the standard to mention anything but a 
score but it could have placeholders for other things.

- Symetry - it seems that joins involving full text matches will in practice
not be commutative:  When making an execution plan, a text 
index lookup can only go to a place where the text expression is bound.  If
the text match binds other variables, anything depending on 
these variables can be evaluated  only after the text match.

- Due to the above, full text is not quite surface syntax.  But in many
places it will look like such.

- In practice, we do not allow contains in SQL or SPARQL inside an OR.  If
one wishes an OR, one can write it in the text pattern.  The text 
pattern language has the connectives of and/or/not and plus phrase and
proximity.  A negated contains is also not allowed, although one 
could do this with a negated  exists subquery.  We have never suffered any
inconvenience from these limitations.  But we see that  a purist 
might call these restrictions arbitrary and ad hoc.

Thus, if full text is treated as a filter, it must be specified as such, as
XPATH and SQL have done.  Then implementations will have to deal with this
inside OR's, NOT's etc., sometimes use a text index and sometimes not.  The
score is the only thing that can be returned.

I would prefer text search as a pattern, analogous to a SQL table valued
function or derived table.  This can bind many variables and by its nature
does not occur in expressions.  If one wishes to OR or negate these, one
uses a union or not exists.



Orri

Received on Wednesday, 6 May 2009 06:46:14 UTC