W3C home > Mailing lists > Public > public-powderwg@w3.org > March 2008

Re: String Matching -> Reg Ex is not always easy

From: Phil Archer <parcher@icra.org>
Date: Wed, 26 Mar 2008 12:08:03 +0000
Message-ID: <47EA3CA3.6060606@icra.org>
To: Public POWDER <public-powderwg@w3.org>

More on this...

I've been playing with the regular expressions that one would need to 
write to capture the meaning of the string elements. To do this I've set 
up a little tool at [1] that allows you to put in a Reg ex and a string 
and see if the two match.

Let's start with includehosts. The Reg Ex needs to be pretty specific so 
that it only matches a domain name. I ended up with this:

^\w+://(\w+\.)?example.org(:\d+)?\/

So it starts by allowing anything except ./?# and these must be followed 
by ://.

That, incidentally knocks out ISAN numbers or mail to URLs. It's only 
http, https and ftp that have // after the : .

Then we need to allow sub domains but not just any string ending with 
example.org, hence (\w+\.)?

Then there's the supplied string, followed by either the port number or a /.

So this reg ex matches:

http://example.org/
http://www.example.org:45/

and does not match

http://www.example.org.bad.com/

Let's look at scheme...

<includeschemes>http </includeschemes>

That's pretty easy:

^http://

So you just put an anchor at the start, then the supplied string, then ://.

For:

<includepathstartswith>/foo</includepathstartswith>

I ended up with

^\w+://[^\:\/\?\#\@ ]+\/foo

And so on.

The question is... is mapping each IRI constraint to a regular 
expression like this actually better than just using the element names? 
What's the benefit Stasinos?

P

[1] http://www.icra.org/regularexpression/


Stasinos Konstantopoulos wrote:
> On Tue Mar 25 11:55:42 2008 Phil Archer said:
> 
>> But let's make this progressively more complex and see whether we can  
>> convert _all_ possible POWDER IRI sets into POWDER-S versions with a  
>> single reg ex.
> 
> Why a single regex? This is an unnecessary complication.
> Each string pattern need only map to a single reg ex pattern, and a URI
> has to pass all reg ex tests to match the iriset. Conjunction can be
> very naturally represented in in OWL/RDF, so that's not an issue.
> 
>> [easy disjunction snipped]
>>
>> OK, let's cut to the chase. POWDER allows very sophisticated IRI set  
>> definitions like this:
>>
>> <iriset>
>>   <includeschemes>http https</includeschemes>
>>   <includehosts>example.org example.com</includehosts>
>>   <includepathcontains>foo bar</includepathcontains>
>>   <includepathcontains>red blue</includepathcontains>
>> </iriset>
>>
>> Here we have either http or https. OK, in reg ex that's https? add in  
>> the host and we get
>>
>> ^https?://(.*\.)?(example.com|example.org)
> 
> I would strongly discourage the XSLT author from trying to get smart and
> suggest they keep it simple instead:
> 
> <includeregex>(^http) | (^https)</includeregex>
> <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
> <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
> <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>
> 
> Easy, and straight to the point. There's lots of automatic finite-state
> combination and optimization tools out there, if the implementor needs
> to worry about efficient application of the patterns. Although,
> implementors who care about efficiency are better off directly
> implementing the extensions in the first place.
> 
>> And would anyone like to hazard a bit of code that rendered this as a  
>> reg ex:
>>
>> <iriset>
>>   <includeschemes>http https</includeschemes>
>>   <includehosts>example.org example.com</includehosts>
>>   <includepathcontains>foo bar</includepathcontains>
>>   <includepathcontains>red blue</includepathcontains>
>>   <excludeexactqueries>name1=value1&amp;name2=value2
>>     </excludeexactqueries>
>> </iriset>
> 
> Gladly. It is:
> 
> <includeregex>(^http) | (^https)</includeregex>
> <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex>
> <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex>
> <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex>
> <includeregex>(^[^?]+?.*name1=value1)</includeregex>
> <includeregex>(^[^?]+?.*name2=value2)</includeregex>
> 
> It doesn't look all that bad to me.
> 
> s
Received on Wednesday, 26 March 2008 12:47:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:42:12 GMT