- From: Phil Archer <parcher@icra.org>
- Date: Wed, 26 Mar 2008 12:08:03 +0000
- To: Public POWDER <public-powderwg@w3.org>
More on this... I've been playing with the regular expressions that one would need to write to capture the meaning of the string elements. To do this I've set up a little tool at [1] that allows you to put in a Reg ex and a string and see if the two match. Let's start with includehosts. The Reg Ex needs to be pretty specific so that it only matches a domain name. I ended up with this: ^\w+://(\w+\.)?example.org(:\d+)?\/ So it starts by allowing anything except ./?# and these must be followed by ://. That, incidentally knocks out ISAN numbers or mail to URLs. It's only http, https and ftp that have // after the : . Then we need to allow sub domains but not just any string ending with example.org, hence (\w+\.)? Then there's the supplied string, followed by either the port number or a /. So this reg ex matches: http://example.org/ http://www.example.org:45/ and does not match http://www.example.org.bad.com/ Let's look at scheme... <includeschemes>http </includeschemes> That's pretty easy: ^http:// So you just put an anchor at the start, then the supplied string, then ://. For: <includepathstartswith>/foo</includepathstartswith> I ended up with ^\w+://[^\:\/\?\#\@ ]+\/foo And so on. The question is... is mapping each IRI constraint to a regular expression like this actually better than just using the element names? What's the benefit Stasinos? P [1] http://www.icra.org/regularexpression/ Stasinos Konstantopoulos wrote: > On Tue Mar 25 11:55:42 2008 Phil Archer said: > >> But let's make this progressively more complex and see whether we can >> convert _all_ possible POWDER IRI sets into POWDER-S versions with a >> single reg ex. > > Why a single regex? This is an unnecessary complication. > Each string pattern need only map to a single reg ex pattern, and a URI > has to pass all reg ex tests to match the iriset. Conjunction can be > very naturally represented in in OWL/RDF, so that's not an issue. > >> [easy disjunction snipped] >> >> OK, let's cut to the chase. POWDER allows very sophisticated IRI set >> definitions like this: >> >> <iriset> >> <includeschemes>http https</includeschemes> >> <includehosts>example.org example.com</includehosts> >> <includepathcontains>foo bar</includepathcontains> >> <includepathcontains>red blue</includepathcontains> >> </iriset> >> >> Here we have either http or https. OK, in reg ex that's https? add in >> the host and we get >> >> ^https?://(.*\.)?(example.com|example.org) > > I would strongly discourage the XSLT author from trying to get smart and > suggest they keep it simple instead: > > <includeregex>(^http) | (^https)</includeregex> > <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex> > <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex> > <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex> > > Easy, and straight to the point. There's lots of automatic finite-state > combination and optimization tools out there, if the implementor needs > to worry about efficient application of the patterns. Although, > implementors who care about efficiency are better off directly > implementing the extensions in the first place. > >> And would anyone like to hazard a bit of code that rendered this as a >> reg ex: >> >> <iriset> >> <includeschemes>http https</includeschemes> >> <includehosts>example.org example.com</includehosts> >> <includepathcontains>foo bar</includepathcontains> >> <includepathcontains>red blue</includepathcontains> >> <excludeexactqueries>name1=value1&name2=value2 >> </excludeexactqueries> >> </iriset> > > Gladly. It is: > > <includeregex>(^http) | (^https)</includeregex> > <includeregex>(^[^/]+//example.org)| (^[^/]+//example.com)</includeregex> > <includeregex>(^[^/]+//[^/]/.*foo) | (^[^/]+//[^/]/.*bar) </includeregex> > <includeregex>(^[^/]+//[^/]/.*red) | (^[^/]+//[^/]/.*blue) </includeregex> > <includeregex>(^[^?]+?.*name1=value1)</includeregex> > <includeregex>(^[^?]+?.*name2=value2)</includeregex> > > It doesn't look all that bad to me. > > s
Received on Wednesday, 26 March 2008 12:47:06 UTC