W3C home > Mailing lists > Public > public-powderwg@w3.org > March 2008

String Matching -> Reg Ex is not always easy

From: Phil Archer <parcher@icra.org>
Date: Tue, 25 Mar 2008 11:55:42 +0000
Message-ID: <47E8E83E.80208@icra.org>
To: Public POWDER <public-powderwg@w3.org>

N.B. This discussion refers to the Grouping Doc dated 20 March and 
available at [1], currently only with member access. This is expected to 
be published at the same URI within the next 24 hours or so.


Over on the member list it has been suggested that POWDER-S should 
_only_ support IRI constraint by regular expression [2], although POWDER 
would retain things like includehosts for ease of use.

The argument is initially attractive since we expect to see IRI sets 
like this most commonly:

<iriset>
   <includehosts>example.org</includehosts>
</iriset>

i.e. a single domain name given as the IRI set so we're describing 
'everything on example.org. This can be transformed into POWDER-S thus:

<wdr:iriset>
   <owl:intersectionOf rdf:parseType="Collection">
     <owl:Restriction>
       <owl:onProperty rdf:resource="&wdr;includeregex" />
       <owl:hasValue>example.org</owl:hasValue>
     </owl:Restriction>
   </owl:intersectionOf>
</wdr:iriset>

i.e. the reg ex is the same in both cases. Easy. Since we expect POWDER 
to be the main transport mechanism and for POWDER-S to (almost) always 
be derived programmatically, it doesn't matter how complex a POWDER-S 
doc is.

But let's make this progressively more complex and see whether we can 
convert _all_ possible POWDER IRI sets into POWDER-S versions with a 
single reg ex.

Let's try multiple hosts.

   <includehosts>example.org example.com</includehosts>

becomes

example.org|example.com

OK, let's cut to the chase. POWDER allows very sophisticated IRI set 
definitions like this:

<iriset>
   <includeschemes>http https</includeschemes>
   <includehosts>example.org example.com</includehosts>
   <includepathcontains>foo bar</includepathcontains>
   <includepathcontains>red blue</includepathcontains>
</iriset>

Here we have either http or https. OK, in reg ex that's https? add in 
the host and we get

^https?://(.*\.)?(example.com|example.org)

But those multiple path constraints are going to kill us. They say that 
the path must contain either foo or bar AND either red or blue _in any 
order_.

So the following all match:
http://example.com/red/bar
http://example.com/foo/blue
https://example.org/bluefoo/bar.html

And this doesn't:

http://example.org/foo/bar/

Now, I _could_ work out a Reg Ex that did all this, but I'm not sure I 
could write some code that turned _any valid_ POWDER IRI set definition 
into a Reg Ex.

And would anyone like to hazard a bit of code that rendered this as a 
reg ex:

<iriset>
   <includeschemes>http https</includeschemes>
   <includehosts>example.org example.com</includehosts>
   <includepathcontains>foo bar</includepathcontains>
   <includepathcontains>red blue</includepathcontains>
   <excludeexactqueries>name1=value1&amp;name2=value2
     </excludeexactqueries>
</iriset>

Bearing in mind that this means that if the query string contains name 1 
= value 1 and name 2 = value 2 pairs in any order then they're to be 
excluded?

Yikes!

So, I think it would be a lot easier to retain string-based matching in 
POWDER-S.

Phil.


[1] http://www.w3.org/2007/powder/Group/powder-grouping/20080320.html
[2] http://lists.w3.org/Archives/Member/member-powderwg/2008Mar/0119.html

-- 
Phil Archer
Chief Technical Officer,
Family Online Safety Institute
w. http://www.fosi.org/people/philarcher/
Received on Tuesday, 25 March 2008 11:56:25 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:42:12 GMT