Re: [WAC] regexps in WebAccessControl from Phil Archer on 2012-11-22 (public-rww@w3.org from November 2012)

From: Phil Archer <phila@w3.org>
Date: Thu, 22 Nov 2012 16:07:07 +0000
To: Henry Story <henry.story@bblfish.net>
CC: Read-Write-Web <public-rww@w3.org>, nathan <nathan@webr3.org>, Ruben Verborgh <ruben.verborgh@ugent.be>, Alexandre Bertails <bertails@w3.org>, Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
Message-ID: <50AE4DAB.5030509@w3.org>
Henry, everyone, some additional comments inline below.

On 21/11/2012 11:09, Henry Story wrote:
> Hi Phil,
>
>     Thanks for the very helpful overview on POWDER. From the comments earlier on this thread
> I heard people worry about full regex being
>
>    1. too complicated to parse/write
>    2. memory intensive ( a server would need to keep a cache of regexps )
>    3. dangerous if one fetches them off the web, as currently it would be possible to with WebACLs
>
> So for all of the above your answer is that you have an XML syntax that is easy to write.
>
> <iriset>
>    <includehosts>example.org</includehosts>
>    <includepathstartswith>/foo</includepathstartswith>
> </iriset>
>
> whose semantics are defined as being equivalent to some rdf. The above example
> for example being equivalent to
>
> @prefix powder: <http://www.w3.org/2007/05/powder-s#> .
> @prefix owl: <http://www.w3.org/2002/07/owl#> .
>
> :joesNS owl:equivalentClass [ owl:intersectionOf (
>          [ a owl:Restriction;
>            owl:onProperty powder:matchesregex;
>            owl:hasValue "(([^\/\?\#]*)\@)?([^\:\/\?\#\@]+\.)?(example\.org)(:([0-9]+))?\/"],
>          [ a owl:Restriction;
>            owl:onProperty powder:matchesregex;
>            owl:hasValue "(([^\/\?\#]*)\@)?([^\:\/\?\#\@]*)(\:([0-9]+))?(\/foo)" ]
>         )
>       ] .  // :-)
>

Yes. Correct. Then you have another OWL class of things that have the 
properties you want (i.e. level of authorisation or whatever) and then 
you assert that :joesNS is a sub class of :bunchOfProperties

> It is clear that the xml iri set notation could be coded very efficiently using normal programming
> tools. All programming languages have a URL class already defined, so that one could use those
> parsers directly. I imagine that if one holds oneself to the xml notation one cannot get a
> denial of service regexp ( are there such things?)

Yes. The use of regexes is obviously heavy on the processor and so 
string-based methods are significantly more efficient.

One thing to note - those XML elements take white space separated lists 
so you can have

<includehosts>example.com example.org</includehosts>

or, perhaps more likely in this use case

<includehosts>department1.example.com department2.example.com</includehosts>

i.e. put whatever subdomains you want in there.

This translates into logical OR which, added to the logical AND of any 
candidate URI having to match all the defined rules, is what makes it 
flexible.

>
> So as we want to be able to work with the results of the LDP group [8], we need to have
> a syntax to express your xml in Turtle. Something like this:
>
> :joesNS a p:IriSet;
>     p:includeHost "example.org";
>     p:includePathStartsWith "/foo" .
>
> I was wondering if this simple semantics is something the POWDER WG could feasibly publish.

That tells you that there is a class with those properties, yes, but 
you'd still need to make the transformation into OWL for the POWDER 
semantics. You can't treat that as being semantically equivalent to the 
OWL class you correctly gave above - it isn't.

The XML dialect *is* semantically equivalent because of the GRDDL 
transformation that is linked from the namespace document (that 
generates the OWL).

Now, of course, you can say that you do this and actually not bother, 
just take those strings and use them without all the transformation 
stuff, that's an internal matter, but it would be custom software that 
would not be conformant with the Semantic Web at large.

You could perhaps try and squeeze some square pegs into some round holes 
here. The Turtle you give is semantically equivalent to the same triples 
in RDF/XML of course, so you could presumably define an XSLT and 
associate it with the p namespace, or do the transformation directly in 
N3 or whatever. But that's taking you outside the issues you care about 
and sounds a bit iffy.

POWDER's problem from the start was it was trying to solve the aboutEach 
prefix issue in a semantically rigorous way. We got it down to just two 
properties in the end and had to define a semantic extension for them.

I'm afraid the POWDER WG no longer exists and many relevant folk have 
moved on. However, an institution that *is* actively working with 
POWDER, albeit in a different circumstance, is the National Centre for 
Scientific Research, NCSR Demokritos, in Athens. I've added Stasinos to 
the cc list here. He was the primary author of the POWDER semantics doc.

If the RWW group wants an interested W3C member to work with the CG, 
that's your best bet. Basically, *if* this group finds this useful if 
only A, B & C were a little different, then NCSR Demokritos and other 
members may want to look at the Member submission process.

POWDER was designed for a task that no one uses it for. If this or 
another group finds elements that can be adapted for something useful - 
please do it.


> We have a couple of use cases:
>
>    A. determining groups of resources ( that can be accessed )
>    B. determining groups of users ( that can access a resource )
>
> A. groups of resources
> ---------------------
>
> I think :joesNS is a class so that one should be able to just write
>
> @prefix wac: <http://www.w3.org/ns/auth/acl> .
>
>   [ wac:accessToClass :joesNS;
>     wac:mode wac:Read, wac:Write;
>     acl:agent <card#i>].
>
>
> B. groups of people
> -------------------
>
>   Again I think here we could have
>
> # I could not get anything simpler than this
> :coEmployees a p:IriSet;
>     p:includeRegex "^https://company.com/ppl/[^/]+#me$" .
>
>
> :coRes a p:IriSet
>     p:includeRegex "^https://company.com/ppl/.*"
>
>
> which should alow us to build the following Rule
>
>   [] wac:accessToClass :coRes;
>     wac:mode wac:Read;
>     acl:agentClass :coEmployees .
>

I know that's tempting but it's semantically incorrect. Stasinos is 
better than me to answer this but you need to close the world so that 
someone else can't publish:

:coEmployees p:matchesregex "^http://badboys.evil/.*"

(aside - it's matches regex, not includes)

As above, you can create N3 rules to do whatever you want and that might 
be the right way to go, but the way POWDER was done was, yes, partly to 
create as simple a method of encoding as possible, one that could be 
written by a non-expert and processed in a purely XML environment (like 
a browser) but could also be processed as RDF. But it was also done to 
handle the awkward semantics. GRDDL was hot at the time so we used that 
as the basis for our semantic equivalence.


Looking forward to the call tomorrow!

Cheers

Phil.




>
> So if one could have those terms defined then we would be able to use those
> and put them up on the WebAccessControl wiki page, in preparation for writing out
> a spec.
>
> 	Henry
>
>
> [8] http://www.w3.org/2012/ldp/hg/ldp.html
>
>
> On 19 Nov 2012, at 13:06, Phil Archer <phila@w3.org> wrote:
>
>> Henry, everyone, let me see what I can offer here (for the many for whom my name means nothing, I lead the work on POWDER and am indelibly associated with it).
>>
>> The problem we faced is, I think, much the same as you have here. You want something that is easy to understand, such as "everyone with a URI that begins with http://example.org/people/trusted/" but at the same time have a processable means of handling this.
>>
>> So, we created a set of XML elements that were meant to be easy to use, such as:
>>
>> includehosts
>> includepathstartswith
>> includequerycontains
>>
>> For every 'include' there's a matching 'exclude' - and we covered scheme, host, path contains, path starts with, path ends with, ports, query strings and regexes and a full URI.
>>
>> That's what we called POWDER Grouping and it has its own separate Recommendation [1]. But this is a simplification layer. Within that doc we also defined how to turn any of those 'user-friendly elements' into regular expressions, for which we provided templates that you can bet we tested and re-tested. They're not simple but they are meant to be robust (the one that lets you include query string name/value pairs in any order was a lot of fun - not). The doc also covers IRI canonicalization which is important in this space.
>>
>> You can programmatically replace any of the user-friendly terms with matcheseregex (which we called POWDER-BASE) and it is *that* property (and notmatchesregex) that is the subject of POWDER's Semantic Extension [2]. The semantics of POWDER are fully defined.
>>
>> Any POWDER document (XML) can be transformed into POWDER-BASE (also XML, identical except that the only IRI set defining properties allowed are (not)matchesregex) and that can then be transformed into OWL *with the semantic extension* that allows you to run a regex against a URI - think of it SPARQL's regex(str(URI)).
>>
>> Semantically, all 3 flavours of a POWDER document are defined as identical. Only the syntax changes.
>>
>> POWDER can define any set of URIs, no matter how complex [3]
>>
>> The domain of wdrs:matchesregex is rdfs:Resource, its range xsd:string [4] - i.e. there's no weird inferencing there.
>>
>> Although I seem to recall looking it up, I see that we didn't actually define the regex syntax we used. I can only leave it to other to answer the Java Regex issue.
>>
>> There are some POWDER tools at [5] including a grouping tester [6]. That lets you put in values for the user-friendly URI components and then test a given URI to see if it is or is not covered.
>>
>> Hope this helps?
>>
>> Shout if you need more
>>
>> Phil.
>>
>>
>> [1] http://www.w3.org/TR/powder-grouping/
>> [2] http://www.w3.org/TR/powder-formal/#regexSemantics
>> [3] http://www.w3.org/TR/powder-grouping/#conj-disj
>> [4] http://www.w3.org/2007/05/powder-s#matchesregex
>> [5] http://philarcher.org/powder/
>> [6] http://philarcher.org/cgi-bin/powder-group.cgi
>>
>>
>>
>> On 19/11/2012 11:01, Henry Story wrote:
>>> CCing Phil Archer.
>>> ( Phil the thread for this starts here:
>>>     http://lists.w3.org/Archives/Public/public-rww/2012Nov/0119.html )
>>>
>>> On 19 Nov 2012, at 02:31, Alexandre Bertails <bertails@w3.org> wrote:
>>>
>>>> On 11/18/2012 04:06 PM, Nathan wrote:
>>>>> Henry Story wrote:
>>>>>>   []  wac:accessToClass [ wac:regex "http://joe.example/blog/.*" ];
>>>>
>>>> For file matching patterns, I'd suggest not to reinvent the wheel and
>>>> use something that has existed for a long time: ant patterns [1]. It's
>>>> already defined, and the regex can be easily parsed and then compiled
>>>> down to any language specific regex.
>>>
>>> I just came across the following discussion on IRC, which seems relevant to this.
>>>
>>> <blockquote>
>>> 21:49 presbrey: bblfish, if you want to have regex we should support simple globbing too
>>> 21:50 presbrey: most users do not write /admin/.*, they write /admin/*
>>> 21:51 presbrey: also do we really want to incorporate blank nodes? this is the first proposal to do so
>>> 21:54 presbrey: such a pattern also seems to duplicate eg.
>>> 21:54 presbrey: acl:defaultForNew </admin/>
>>> 21:57 presbrey: also in this particular scenario, it costs more to compile the regex pattern than to evaluate it
>>> 21:58 presbrey: in more complex examples, the server now needs a resident regex cache
>>> 21:59 melvster: perhaps arbitrary regex could be an attack surface too depending on who has accesss
>>> 22:17 betehess would prefer to have ant style
>>> 22:23 presbrey: betehess, do you know how I can parse ant style in python or php?
>>> 22:24 presbrey: and javascript? :)
>>> 22:24 betehess: shouldn't be difficult
>>> 22:24 betehess: we'll need to define the regex grammar anyway
>>> 22:25 betehess: at the end, any language should be able to compile them down to their own native regex style
>>> 22:26 presbrey: at the end?
>>> 22:26 betehess: http://trac.mach-ii.com/machii/wiki/ANTPatternMatcher
>>> 22:26 betehess: just three wildcards
>>> 22:26 betehess: having both ** and * is pretty cool
>>> </blockquote>
>>>
>>> Yes, I can see that less powerful than full regexs could be helpful in reducing
>>> regex based denial of service attacks for remotely published regex rules. Also
>>> it is easier to specify for people correctly.
>>>
>>> That is why POWDER already has worked on simplified groupings, by proposing an
>>> XML format for simple definitions. See for example here:
>>>
>>>    http://www.w3.org/TR/powder-grouping/#wild
>>>
>>> I think it would be nice to semanticise those higher level relations so that
>>> one can also use them directly in Turtle. Perhaps this is something we can ask
>>> the POWDER group to do, if they are still around?
>>>
>>> Henry
>>>
>>>
>>>>
>>>> Alexandre.
>>>>
>>>> [1] http://ant.apache.org/manual/dirtasks.html#patterns
>>>>
>>>>>
>>>>> What would [ wac:regex "http://joe.example/blog/.*" ] mean?
>>>>>
>>>>> Using OWL 2 we can create a datatype definition, using a datatype
>>>>> restriction, on strings and the like - but that doesn't (anywhere near)
>>>>> cover what's required here.
>>>>>
>>>>> I'm unsure how we'd actually create a Class of things based on the
>>>>> lexical form of a URI though, or even, whether it's a good idea to do so
>>>>> - we are basically saying that if a URI has a lexical form which matches
>>>>> the regular expression x, then that URI denotes something which is of
>>>>> the class y. This feels wrong.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nathan
>>>>>
>>>>>
>>>>
>>>
>>> Social Web Architect
>>> http://bblfish.net/
>>>
>>
>> --
>>
>>
>> Phil Archer
>> W3C eGovernment
>> http://www.w3.org/egov/
>>
>> http://philarcher.org
>> +44 (0)7887 767755
>> @philarcher1
>
> Social Web Architect
> http://bblfish.net/
>

-- 


Phil Archer
W3C eGovernment
http://www.w3.org/egov/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Thursday, 22 November 2012 16:07:46 UTC