RE: More fulltext advocacy (was Re: Lee's feature proposal) from Orri Erling on 2009-05-04 (public-rdf-dawg@w3.org from April to June 2009)

From: Orri Erling <erling@xs4all.nl>
Date: Tue, 5 May 2009 00:30:09 +0200
To: "'Seaborne, Andy'" <andy.seaborne@hp.com>, "'Axel Polleres'" <axel.polleres@deri.org>
Cc: <public-rdf-dawg@w3.org>
Message-Id: <200905042230.n44MUUBd015213@smtp-vbr1.xs4all.nl>

Hi


Standardizing full text is, I fear, next to impossible in the context of
this WG.  Of course, I have said and blogged elsewhere 
that no text search box  is a no starter and I stand by this.  We of course
implement text search.  But looking at efforts of standardizing 
text search with SQL and XQuery, the thing is non-trivial.  See how much
time the XQuery/XPATH folks or the  SQL MM   committee put into it, I do not
have the figure but if somebody does, this gives a baseline.  I'd say the
time used was not insignificant.

The proposals so far seem rather ad hoc.  For example, text search is
generally expected to produce a hit score.  The ability to produce 
this often depends on word proximities.  So in practice,for modern text
indices, one may take a phrase search feature almost  for granted 
since word positions will de facto have to be in the index.  Also string
length limits etc will not work well with ideograms, there will be 
other language issues, best not get started, just look at the XPATH full
text work since it even is from a neighbor group.


The SQL MM or XPATH full text specs are quite complex but they have the
merit of being well considered.   However, the RDBMS's out there 
each do their own text search, not the SQL MM spec.  I have the feeling that
the XPATH and SQL MM specs are conceptually neater than what is in the
DBMS's but also harder to implement and for most use cases the standards do
not offer decisive advantage.  Maybe this is why adoption is lacking.  I
could be out of date, though.

So, I would consider  it possible to standardize in pragmas and end point
descriptions that an implementation could 
declare that it is for full text 1. compatible with some other
implementation 2. compatible with XPATH full text.  

Specifying a full text match language all our own seems both difficult and
needless.  But since full text in itself is rather vital, we 
might specify  a means for an implementation to declare that it has full
text support.  

We might go as far as to say that a pattern like:

?xx sparql:fulltext ?text_exp ?score [option, ....].

Would  mean that ?xx be bound to a literal for which ?text-exp matches,
binding the implementation dependent match score to ?score.  [option] would 
be implementation dependent.  The syntax of ?text_exp would be
implementation dependent.  There would be asymetryin requiring ?text_exp to 
be either a literal or an outside binding, i.e. parameter.  Or one might say
that if ?text_exp is not a parameter and not a literal, then it 
must have a binding other than another full text match:  One cannot join
based on two things sharing an unspecified text pattern since all 
things would join because a single wildcard matches anything.

Then an implementation could declare that it has this triple pattern syntax
and is compatible with whatever widely known SPARQL or  
SQL syntax for the text expression or that it is compatible with some
standard proposal like XPATH or SQL MM.  I guess the latter would  be 
a rarity, though.


So, technically, full text can be specified by referring to existing specs
while making a new spec is a world of trouble.  If XPATH's full text
language did not look so different from 
the SQL ones, we would have implemented it.  Now we offer what we offer for
SQL, which is about the same as MS SQL Server does.  


If the WG committed to standardizing full text in SPARQL to be according to
XPATH , we would implement.  But without this WG decision, we 
would stay with what we have, which is familiar to SQL people and easy to
process from a text search box.


So, yes to being able to advertize capabilities and yes to  a syntax for
expressing a full text match and score but the  text pattern and semantics
of matching should be defined by referring to existing work.


Orri

Received on Monday, 4 May 2009 22:31:09 UTC