- From: Orri Erling <erling@xs4all.nl>
- Date: Tue, 5 May 2009 00:30:09 +0200
- To: "'Seaborne, Andy'" <andy.seaborne@hp.com>, "'Axel Polleres'" <axel.polleres@deri.org>
- Cc: <public-rdf-dawg@w3.org>
Hi Standardizing full text is, I fear, next to impossible in the context of this WG. Of course, I have said and blogged elsewhere that no text search box is a no starter and I stand by this. We of course implement text search. But looking at efforts of standardizing text search with SQL and XQuery, the thing is non-trivial. See how much time the XQuery/XPATH folks or the SQL MM committee put into it, I do not have the figure but if somebody does, this gives a baseline. I'd say the time used was not insignificant. The proposals so far seem rather ad hoc. For example, text search is generally expected to produce a hit score. The ability to produce this often depends on word proximities. So in practice,for modern text indices, one may take a phrase search feature almost for granted since word positions will de facto have to be in the index. Also string length limits etc will not work well with ideograms, there will be other language issues, best not get started, just look at the XPATH full text work since it even is from a neighbor group. The SQL MM or XPATH full text specs are quite complex but they have the merit of being well considered. However, the RDBMS's out there each do their own text search, not the SQL MM spec. I have the feeling that the XPATH and SQL MM specs are conceptually neater than what is in the DBMS's but also harder to implement and for most use cases the standards do not offer decisive advantage. Maybe this is why adoption is lacking. I could be out of date, though. So, I would consider it possible to standardize in pragmas and end point descriptions that an implementation could declare that it is for full text 1. compatible with some other implementation 2. compatible with XPATH full text. Specifying a full text match language all our own seems both difficult and needless. But since full text in itself is rather vital, we might specify a means for an implementation to declare that it has full text support. We might go as far as to say that a pattern like: ?xx sparql:fulltext ?text_exp ?score [option, ....]. Would mean that ?xx be bound to a literal for which ?text-exp matches, binding the implementation dependent match score to ?score. [option] would be implementation dependent. The syntax of ?text_exp would be implementation dependent. There would be asymetryin requiring ?text_exp to be either a literal or an outside binding, i.e. parameter. Or one might say that if ?text_exp is not a parameter and not a literal, then it must have a binding other than another full text match: One cannot join based on two things sharing an unspecified text pattern since all things would join because a single wildcard matches anything. Then an implementation could declare that it has this triple pattern syntax and is compatible with whatever widely known SPARQL or SQL syntax for the text expression or that it is compatible with some standard proposal like XPATH or SQL MM. I guess the latter would be a rarity, though. So, technically, full text can be specified by referring to existing specs while making a new spec is a world of trouble. If XPATH's full text language did not look so different from the SQL ones, we would have implemented it. Now we offer what we offer for SQL, which is about the same as MS SQL Server does. If the WG committed to standardizing full text in SPARQL to be according to XPATH , we would implement. But without this WG decision, we would stay with what we have, which is familiar to SQL people and easy to process from a text search box. So, yes to being able to advertize capabilities and yes to a syntax for expressing a full text match and score but the text pattern and semantics of matching should be defined by referring to existing work. Orri
Received on Monday, 4 May 2009 22:31:09 UTC