- From: Jan Wielemaker <wielemak@science.uva.nl>
- Date: Mon, 23 Oct 2006 14:46:13 +0200
- To: public-rdf-dawg-comments@w3.org
On Monday 23 October 2006 13:25, Henry Story wrote: > I am a great SPARQL enthusiast [1], but have become concerned about > the regex functionality in the specification. Having worked at > AltaVista I have developed a feeling for volume and complexity. When > one is dealing with 100 million searches a day or more every cpu > instruction counts. > > My feeling is that regex is much too powerful for any large database > to allow it to be a mandatory part of the spec. (I may very well be > wrong, if so please point me to a study that shows this not to be the > case). Any such database will just not be able to implement such a > query language. Regex therfore must be optional. > > If possible it would be good to allow in addition for a simpler form > of string query language that has been proven to work with large > databases. This would treat words as atom units, allow their > inclusion or exclusion from a string. Things like this > > sparql -cat "Danny Ayers" > > which would find Danny's posts that don't involve cats. AltaVista did > allow for simple suffix wildcards such as searches on cat* . > > I am just posting this because regex raised an alarm flag in my head. > Perhaps this has been dealt with before. I agree that supporting literal search allowing plain search and regex search is a bit strange choice. Regex doesn't scale (unless I missed something). In many real-life situations regex can also not easily find the things you are looking for. In the E-culture project (http://e-culture.multimedian.nl) we currently deal with approx. 8 million triples and 3.5 million literals stored in the SWI-Prolog semantic web library. We currently allow searching literals * plain * case insensitive * by prefix (case insensitive) * representing numbers >=I, =<I or between(I, J) This is achieved by keeping all literals in an AVL tree. All literals that represent a number preceed all other literals and are kept sorted by numerical value. All non-numerical literals are sorted alphabetically, disregarding case and diacritics (using Unicode diacritic removal). As we only keep one copy of a literal in the AVL tree, we provide hooks for new and deleted literals that allow maintaining additional indices for derived values. We use this to provide the same functionality on tokens (words, numbers), inside literals and repeat the process on tokens for indexing on stem and/or Metaphone (sounds-like). Unfortunately, tokenization is generally task dependent, notably in the interpretation of punctuation characters. Stemming and `sounds-like' are language dependent. We combine this using logical expressions, so we can do ?- rdf_find_literals(and(stem(wood), gt(2000)), Literals). to find the set of literals containing a word word with the same stem as `wood' and an integer > 2000. The provided fast search on literals is critical to our application and I believe supports a wide range of applications. Troublesome is the domain and language dependencies introduced by tokenizing, stemming and metaphone. This doesn't have to stop SPARQL from specifying a query language that allows for such searches. --- Jan
Received on Monday, 23 October 2006 12:46:38 UTC