Re: Musings on resource grouping

Use of regex is one of those questions of taste, I suppose. And a discussion
about its merits or otherwise has the risk of becoming akin to a discussion
as to whether it is "better" to drive on the right hand side or the left
hand side of the road (left of course).

That said ... I think that the syntax that is used should be measured
against some requirements (which would preferably be stated requirements)
like minimising the possibility of error, simplicity etc.

In his original post (copied below), Phil points out that the expression
example.org$ prevents a match by example.org.phishing.com. However, it seems
to me that this is quite an error prone mechanism and that it is likely that
many scoping statements would omit the $ and hence potentially be open to
abuse.

I think most people understand that the natural order for a domain name is a
rightmost match, and the potential for error is reduced if the match pattern
"example.com" means an exact and sub-domain match, and does not mean
xxxexample.com, example.com.phishing.com etc. 

As a design rule, it seems to me that the simplest expression should be used
for the commonest use case, rather than demanding that the commonest use
case employs special signifiers.

The use case for matching a sub-domain on its own, or matching random
domains ending in a particular string is actually quite unlikely it seems to
me. From this perspective use of regular expressions would not appear to fit
requirements of simplicity, safety and fitness for purpose ...

Jo 

===


The method by which we can group resources is a key part of what the 
POWDER WG is trying to define.

The following few lines of RDF/XML indicate the beginnings of one 
possible approach but also throw up a lot of questions so I wanted to 
put this in the public domain. Comment is very welcome on this - 
absolutely none of it is set in stone!


1  <wdr:Scope>
2    <wdr:hasScheme>^http$</wdr:hasScheme>
3    <wdr:hasHost>example.org$</wdr:hasHost>
4    <wdr:hasIP>213.249.189.194</wdr:hasIP>

5    <wdr:hasPath>foo</wdr:hasPath>
6    <wdr:hasPath>bar</wdr:hasPath>

7    <wdr:hasProperty>
8      <wdr:Property>
9        <ex:colour>red</ex:colour>
10     </wdr:Property>
11   </wdr:hasProperty>

12   <wdr:propLookUp rdf:resource="http://sparql.example.com" />

13   <wdr:hasNotURI>http://www.example.org/foo/bar.png</wdr:hasNotURI>
14 </wdr:Scope>


The basic idea of an RDF Class containing the definition of the Scope 
seems straightforward enough?*

For a given URI, we wish to find out whether the resource to which it 
resolves is in scope or not. So first split it up into its component 
parts and then do some pattern matching using (Perl 5) regular expressions.

Line 2 uses a regular expression to indicate the the scope applies to 
resources fetched using HTTP. The caret and dollar sign require an exact 
match so that, for example, HTTPS is not in scope (^https?$ would cover 
exactly either HTTP or HTTPS).

Line 3 uses a similar approach to define the scope as being resources on 
the example.org domain or any subdomain thereof (if you want to restrict 
it specifically to example.org, put a caret in front of it).

Importantly, the dollar sign at the end avoids example.org.phishing.com 
being in scope.

4. Line 4 restricts the scope to resources delivered from the given IP 
address. This could be given as an IP range. Useful for large scale CMS 
that generates numeric URIs with no easy pattern matching ability??

Lines 5 and 5 define two elements that must be in the path if a resource 
is to be in scope. The intention is that, as with all elements here, 
these should be combined using logical AND. If logic OR is required, 
they can be presented readily in a single RegExp (foo|bar).

Lines 7 - 11 are an attempt to handle scoping by property. POWDER would 
provide a framework for properties to be used in this way but mustn't 
step over the line to define what kind of properties should be used.

In line 8 a Property Class is defined.

Line 9 provides an example to say that a resource must have the property 
of having the colour red.

Line 12 is intended to indicate that you can find out whether the 
resource is red by sending a SPARQL query to http://sparql.example.com. 
Such provision would be optional since it must cover several use cases:

  - where the content provider is making DRs available and is able to 
provide a look up data table for its resources to facilitate grouping.

  - where the content provider is unable to provide such data and wishes 
to state that the Description only applies to resources that are red - 
and you have to fetch the resources to find this out.

- where a third party is providing DRs and is making an assertion that 
is only true of red resources.

For example, they may wish to say that "all documents written in red ink 
are really hard to read on Mars." In such cases, the assertion remains 
consistent with or without the look up table/service.

Is this approach workable? Should we demand SPARQL or make it more 
generic? In which case we may need something more complex like:

<wdr:hasPropLookUp>
   <wdr:PropLookUp>
     <wdr:propLookUpURI rdf:resource="http://sparql.example.com" />
     <wdr:propLookUpType 
rdf:resource="http://www.w3.org/TR/rdf-sparql-query/" />
   </wdr:PropLookUp>
</wdr:hasPropLookUp>

This is more flexible and extensible but it means that a "generic POWDER 
processor" couldn't be built since it would have to deal with an 
unbounded number of mechanisms for retrieving property data.

Specifying SPARQL may limit usefulness for some? Where SPARQL is used, 
should we actually embed the SPARQL query?

Back to the original example, line 13 simply states that 
http://www.example.org/foo/bar.png is not in scope, despite it meeting 
the other criteria. This serves to exemplify the idea of simply listing 
URIs as being in/out of scope, and of including negation for all elements.

As I said at the top - comments welcome.

Phil.


* As discussed in the WCL-XG, it would, of course, be perfectly possible 
to encode Scope using another format, such as XML. This is still being 
considered by the WG - we could point to an XML literal from the RDF 
graph, for example.


-- 
Phil Archer
Chief Technical Officer,
Family Online Safety Institute
w. http://www.fosi.org/people/philarcher/

Received on Tuesday, 27 March 2007 09:24:45 UTC