RE: RDQL Regular expressions and pattern searches

-------- Original Message --------
> From: Phil Dawes <mailto:pdawes@users.sourceforge.net>
> Date: 11 May 2004 17:50
> 
> Hi Andy, Hi All,
> 
> An area where I've found RDQL a little underspecified is in the syntax
> for regex searches. RAP encloses the regex in quotes (e.g. "/regex/"),
> I'm not sure what sesame does, and Jena regexes must be unquoted
> (e.g. /regex/).
> 
> Unfortunately the grammar in the spec doesn't specify this so I'm not
> sure which is 'correct'.

There isn't a comformance spec for RDQL; 'correct' relies on the
implementers agreeing.  In Jena it is the latter - "" makes it a string, not
a regular expression. Regular expressions are not strings. The syntax
follows Perl with the small addition that the "m" is optional for more
characters (non-alphanumerics, and not " or ' as that creates confusion with
strings).  This is because tests might be on URIs having / in them so
writing 

   ?p =~ !^http://host/namespace#!

can be done.

> 
> 
> While I'm on the subject, the other unfortunate thing about RDQL
> pattern searches is that most relational databases don't support
> regexes and so for rdb backed stores the filtering has to be done
> in-memory.

PostgreSQL (operator ~) and MySQL (operator REGEXP) provide regular
expressions.

> Unfortunately the most common usage of this feature for me
> is to do a global label search, which involves e.g.
> 
> SELECT ?subj, ?label
> WHERE (?subj,<rdfs:label>,?label)
> AND ?label =~ '/phi/i'

We plan to compile that to SQL in Jena.  By looking at the regular
expression, the common cases of case insensitive substring searching and
prefixes of strings can quite simple be turned into appropriate SQL.  This
works through Jena's query handler abstraction, which includes latteing the
store take over all or part of the query evaluation.  A standard utilities
to do the reg exp analysis wil probably be provided so that a query will
normally have certain string operations marked as being thre simpler cases.
That can then turn into an SQL LIKE or SQL regexps; other storage system
might be able to do a good job of string prefix testing

	Andy

> 
> This is obviously a bit of a problem to do in-memory for large
> stores. Is there potentual for a more restricted form of pattern in
> RDQL that could be done in-database?
> e.g. something like SeRQLs 'LIKE' clause
> 
> (for those unfamiliar with seRQL/sesame, this does a case-insensitive
> match with a single wildcard character '*' which matches zero or more
> characters - this nicely mapps to LIKE and % in SQL).
> 
> Cheers,
> 
> Phil

Received on Wednesday, 12 May 2004 07:09:55 UTC