Re: String Matching -> Reg Ex is not always easy

Stasinos Konstantopoulos wrote:
> On Tue Mar 25 11:55:42 2008 Phil Archer said:
> 
>> But let's make this progressively more complex and see whether we can  
>> convert _all_ possible POWDER IRI sets into POWDER-S versions with a  
>> single reg ex.
> 
> Why a single regex? This is an unnecessary complication.

Ah, right. We have said that except for in/exclude query contains and 
in/exclude path contains, each element can only appear once. This helps 
to minimise mistakes and makes the validation easier.

> Each string pattern need only map to a single reg ex pattern, and a URI
> has to pass all reg ex tests to match the iriset. 

Yes, in that case, it makes sense. We'd have to change things a little 
to say that the 'once per IRI set rule' applies to POWDER but not to 
includeregex in POWDER-S.


Conjunction can be
> very naturally represented in in OWL/RDF, so that's not an issue.
> 
>> [easy disjunction snipped]
>>
>> OK, let's cut to the chase. POWDER allows very sophisticated IRI set  
>> definitions like this:
>>
>> <iriset>
>>   <includeschemes>http https</includeschemes>
>>   <includehosts>example.org example.com</includehosts>
>>   <includepathcontains>foo bar</includepathcontains>
>>   <includepathcontains>red blue</includepathcontains>
>> </iriset>
>>
>> Here we have either http or https. OK, in reg ex that's https? add in  
>> the host and we get
>>
>> ^https?://(.*\.)?(example.com|example.org)
> 
> I would strongly discourage the XSLT author from trying to get smart and
> suggest they keep it simple instead:
> 
> <includeregex>(^http) | (^https)</includeregex>
> <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
> <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
> <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>
> 
> Easy, and straight to the point. 

Yes, that's OK.

There's lots of automatic finite-state
> combination and optimization tools out there, if the implementor needs
> to worry about efficient application of the patterns. Although,
> implementors who care about efficiency are better off directly
> implementing the extensions in the first place.

OK, that answers my next point which was that people often complain 
about the processing overhead of using a single reg ex, never mind a 
load of them.

> 
>> And would anyone like to hazard a bit of code that rendered this as a  
>> reg ex:
>>
>> <iriset>
>>   <includeschemes>http https</includeschemes>
>>   <includehosts>example.org example.com</includehosts>
>>   <includepathcontains>foo bar</includepathcontains>
>>   <includepathcontains>red blue</includepathcontains>
>>   <excludeexactqueries>name1=value1&amp;name2=value2
>>     </excludeexactqueries>
>> </iriset>
> 
> Gladly. It is:
> 
> <includeregex>(^http) | (^https)</includeregex>
> <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
> <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
> <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>
> <includeregex>(^[^?]+?.*name1=value1)</includeregex>
> <includeregex>(^[^?]+?.*name2=value2)</includeregex>
> 
> It doesn't look all that bad to me.

True, but you've not negated the query strings... would you keep 
excluderegex??

Could I ask you please to create a POWDER-S OWL class that captured this?

Phil.

Received on Wednesday, 26 March 2008 11:10:59 UTC