W3C home > Mailing lists > Public > public-powderwg@w3.org > March 2008

Re: String Matching -> Reg Ex is not always easy

From: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
Date: Wed, 26 Mar 2008 12:11:57 +0200
To: Public POWDER <public-powderwg@w3.org>
Message-ID: <20080326101157.GA3254@iit.demokritos.gr>

On Tue Mar 25 11:55:42 2008 Phil Archer said:

> But let's make this progressively more complex and see whether we can  
> convert _all_ possible POWDER IRI sets into POWDER-S versions with a  
> single reg ex.

Why a single regex? This is an unnecessary complication.
Each string pattern need only map to a single reg ex pattern, and a URI
has to pass all reg ex tests to match the iriset. Conjunction can be
very naturally represented in in OWL/RDF, so that's not an issue.

> [easy disjunction snipped]
> 
> OK, let's cut to the chase. POWDER allows very sophisticated IRI set  
> definitions like this:
>
> <iriset>
>   <includeschemes>http https</includeschemes>
>   <includehosts>example.org example.com</includehosts>
>   <includepathcontains>foo bar</includepathcontains>
>   <includepathcontains>red blue</includepathcontains>
> </iriset>
>
> Here we have either http or https. OK, in reg ex that's https? add in  
> the host and we get
>
> ^https?://(.*\.)?(example.com|example.org)

I would strongly discourage the XSLT author from trying to get smart and
suggest they keep it simple instead:

<includeregex>(^http) | (^https)</includeregex>
<includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
<includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
<includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>

Easy, and straight to the point. There's lots of automatic finite-state
combination and optimization tools out there, if the implementor needs
to worry about efficient application of the patterns. Although,
implementors who care about efficiency are better off directly
implementing the extensions in the first place.

> And would anyone like to hazard a bit of code that rendered this as a  
> reg ex:
>
> <iriset>
>   <includeschemes>http https</includeschemes>
>   <includehosts>example.org example.com</includehosts>
>   <includepathcontains>foo bar</includepathcontains>
>   <includepathcontains>red blue</includepathcontains>
>   <excludeexactqueries>name1=value1&amp;name2=value2
>     </excludeexactqueries>
> </iriset>

Gladly. It is:

<includeregex>(^http) | (^https)</includeregex>
<includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
<includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
<includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>
<includeregex>(^[^?]+?.*name1=value1)</includeregex>
<includeregex>(^[^?]+?.*name2=value2)</includeregex>

It doesn't look all that bad to me.

s
Received on Wednesday, 26 March 2008 10:12:44 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:42:12 GMT