Re: make regex optional in SPARQL from Jan Wielemaker on 2006-10-23 (public-rdf-dawg-comments@w3.org from October 2006)

From: Jan Wielemaker <wielemak@science.uva.nl>
Date: Mon, 23 Oct 2006 14:46:13 +0200
To: public-rdf-dawg-comments@w3.org
Message-Id: <200610231446.13707.wielemak@science.uva.nl>
On Monday 23 October 2006 13:25, Henry Story wrote:
> I am a great SPARQL enthusiast [1], but have become concerned about
> the regex functionality in the specification. Having worked at
> AltaVista I have developed a feeling for volume and complexity. When
> one is dealing with 100 million searches a day or more every cpu
> instruction counts.
>
> My feeling is that regex is much too powerful for any large database
> to allow it to be a mandatory part of the spec. (I may very well be
> wrong, if so please point me to a study that shows this not to be the
> case). Any such database will just not be able to implement such a
> query language. Regex therfore must be optional.
>
> If possible it would be good to allow in addition for a simpler form
> of string query language that has been proven to work with large
> databases. This would treat words as atom units, allow their
> inclusion or exclusion from a string. Things like this
>
>     sparql -cat "Danny Ayers"
>
> which would find Danny's posts that don't involve cats. AltaVista did
> allow for simple suffix wildcards such as searches on cat* .
>
> I am just posting this because regex raised an alarm flag in my head.
> Perhaps this has been dealt with before.

I agree that supporting literal search allowing plain search and regex
search is a bit strange choice. Regex doesn't scale (unless I missed
something). In many real-life situations regex can also not easily find
the things you are looking for. In the E-culture project
(http://e-culture.multimedian.nl) we currently deal with approx. 8
million triples and 3.5 million literals stored in the SWI-Prolog
semantic web library.  We currently allow searching literals

	* plain
	* case insensitive
	* by prefix (case insensitive)
	* representing numbers >=I, =<I or between(I, J)

This is achieved by keeping all literals in an AVL tree. All literals
that represent a number preceed all other literals and are kept sorted
by numerical value. All non-numerical literals are sorted
alphabetically, disregarding case and diacritics (using Unicode
diacritic removal).

As we only keep one copy of a literal in the AVL tree, we provide hooks
for new and deleted literals that allow maintaining additional indices
for derived values. We use this to provide the same functionality on
tokens (words, numbers), inside literals and repeat the process on
tokens for indexing on stem and/or Metaphone (sounds-like).
Unfortunately, tokenization is generally task dependent, notably in the
interpretation of punctuation characters. Stemming and `sounds-like' are
language dependent.

We combine this using logical expressions, so we can do

	?- rdf_find_literals(and(stem(wood), gt(2000)), Literals).

to find the set of literals containing a word word with the same stem as
`wood' and an integer > 2000.

The provided fast search on literals is critical to our application and
I believe supports a wide range of applications.  Troublesome is the
domain and language dependencies introduced by tokenizing, stemming and
metaphone.  This doesn't have to stop SPARQL from specifying a query
language that allows for such searches.

	--- Jan
Received on Monday, 23 October 2006 12:46:38 UTC