Re: String Matching -> Reg Ex is not always easy

More on this...

I've been playing with the regular expressions that one would need to 
write to capture the meaning of the string elements. To do this I've set 
up a little tool at [1] that allows you to put in a Reg ex and a string 
and see if the two match.

Let's start with includehosts. The Reg Ex needs to be pretty specific so 
that it only matches a domain name. I ended up with this:

^\w+://(\w+\.)?example.org(:\d+)?\/

So it starts by allowing anything except ./?# and these must be followed 
by ://.

That, incidentally knocks out ISAN numbers or mail to URLs. It's only 
http, https and ftp that have // after the : .

Then we need to allow sub domains but not just any string ending with 
example.org, hence (\w+\.)?

Then there's the supplied string, followed by either the port number or a /.

So this reg ex matches:

http://example.org/
http://www.example.org:45/

and does not match

http://www.example.org.bad.com/

Let's look at scheme...

<includeschemes>http </includeschemes>

That's pretty easy:

^http://

So you just put an anchor at the start, then the supplied string, then ://.

For:

<includepathstartswith>/foo</includepathstartswith>

I ended up with

^\w+://[^\:\/\?\#\@ ]+\/foo

And so on.

The question is... is mapping each IRI constraint to a regular 
expression like this actually better than just using the element names? 
What's the benefit Stasinos?

P

[1] http://www.icra.org/regularexpression/


Stasinos Konstantopoulos wrote:
> On Tue Mar 25 11:55:42 2008 Phil Archer said:
> 
>> But let's make this progressively more complex and see whether we can  
>> convert _all_ possible POWDER IRI sets into POWDER-S versions with a  
>> single reg ex.
> 
> Why a single regex? This is an unnecessary complication.
> Each string pattern need only map to a single reg ex pattern, and a URI
> has to pass all reg ex tests to match the iriset. Conjunction can be
> very naturally represented in in OWL/RDF, so that's not an issue.
> 
>> [easy disjunction snipped]
>>
>> OK, let's cut to the chase. POWDER allows very sophisticated IRI set  
>> definitions like this:
>>
>> <iriset>
>>   <includeschemes>http https</includeschemes>
>>   <includehosts>example.org example.com</includehosts>
>>   <includepathcontains>foo bar</includepathcontains>
>>   <includepathcontains>red blue</includepathcontains>
>> </iriset>
>>
>> Here we have either http or https. OK, in reg ex that's https? add in  
>> the host and we get
>>
>> ^https?://(.*\.)?(example.com|example.org)
> 
> I would strongly discourage the XSLT author from trying to get smart and
> suggest they keep it simple instead:
> 
> <includeregex>(^http) | (^https)</includeregex>
> <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
> <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
> <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>
> 
> Easy, and straight to the point. There's lots of automatic finite-state
> combination and optimization tools out there, if the implementor needs
> to worry about efficient application of the patterns. Although,
> implementors who care about efficiency are better off directly
> implementing the extensions in the first place.
> 
>> And would anyone like to hazard a bit of code that rendered this as a  
>> reg ex:
>>
>> <iriset>
>>   <includeschemes>http https</includeschemes>
>>   <includehosts>example.org example.com</includehosts>
>>   <includepathcontains>foo bar</includepathcontains>
>>   <includepathcontains>red blue</includepathcontains>
>>   <excludeexactqueries>name1=value1&amp;name2=value2
>>     </excludeexactqueries>
>> </iriset>
> 
> Gladly. It is:
> 
> <includeregex>(^http) | (^https)</includeregex>
> <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
> <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
> <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>
> <includeregex>(^[^?]+?.*name1=value1)</includeregex>
> <includeregex>(^[^?]+?.*name2=value2)</includeregex>
> 
> It doesn't look all that bad to me.
> 
> s

Received on Wednesday, 26 March 2008 12:47:06 UTC