Re: Musings on resource grouping from Phil Archer on 2007-03-27 (public-powderwg@w3.org from March 2007)

From: Phil Archer <parcher@icra.org>
Date: Tue, 27 Mar 2007 17:09:57 +0100
To: public-powderwg@w3.org
Message-ID: <460941D5.3050202@icra.org>
You're right Jo, personal taste comes into this. personally, I like 
regular expressions because you can do whizzo things with them and 
simple ones, like example.com$ are not hard to master.

I should also say that I don't foresee many people writing DRs by hand - 
  we need tools for this.

But there's another point here. You say "I think most people understand 
that the natural order for a domain name is a rightmost match" - and of 
course you're right. But it's that clause "most people understand" that 
is critical. Computers are, of course, stupid and need to be told the 
simplest things. So, I could write this bit of XML:

<Scope xmlns="http://blah">
   <host>
     <match name="example.mobi"/>
   </host>
</Scope>

And publish supplementary information that says "unless told otherwise 
you should match the host right-wise." Or you can express in your DTD 
that the range of host is a Perl 5 Regular Expression and leave the data as

<Scope>
   <host>
     <match name="example.mobi$" />
   </host>
</Scope>

That said (you know I'd have to say that) I do, of course, take the 
point about ease of use and reducing the opportunities for errors as far 
as possible.

Phil.

Jo Rabin wrote:
> Use of regex is one of those questions of taste, I suppose. And a discussion
> about its merits or otherwise has the risk of becoming akin to a discussion
> as to whether it is "better" to drive on the right hand side or the left
> hand side of the road (left of course).
> 
> That said ... I think that the syntax that is used should be measured
> against some requirements (which would preferably be stated requirements)
> like minimising the possibility of error, simplicity etc.
> 
> In his original post (copied below), Phil points out that the expression
> example.org$ prevents a match by example.org.phishing.com. However, it seems
> to me that this is quite an error prone mechanism and that it is likely that
> many scoping statements would omit the $ and hence potentially be open to
> abuse.
> 
> I think most people understand that the natural order for a domain name is a
> rightmost match, and the potential for error is reduced if the match pattern
> "example.com" means an exact and sub-domain match, and does not mean
> xxxexample.com, example.com.phishing.com etc. 
> 
> As a design rule, it seems to me that the simplest expression should be used
> for the commonest use case, rather than demanding that the commonest use
> case employs special signifiers.
> 
> The use case for matching a sub-domain on its own, or matching random
> domains ending in a particular string is actually quite unlikely it seems to
> me. From this perspective use of regular expressions would not appear to fit
> requirements of simplicity, safety and fitness for purpose ...
> 
> Jo 
> 
> ===
> 
> 
> The method by which we can group resources is a key part of what the 
> POWDER WG is trying to define.
> 
> The following few lines of RDF/XML indicate the beginnings of one 
> possible approach but also throw up a lot of questions so I wanted to 
> put this in the public domain. Comment is very welcome on this - 
> absolutely none of it is set in stone!
> 
> 
> 1  <wdr:Scope>
> 2    <wdr:hasScheme>^http$</wdr:hasScheme>
> 3    <wdr:hasHost>example.org$</wdr:hasHost>
> 4    <wdr:hasIP>213.249.189.194</wdr:hasIP>
> 
> 5    <wdr:hasPath>foo</wdr:hasPath>
> 6    <wdr:hasPath>bar</wdr:hasPath>
> 
> 7    <wdr:hasProperty>
> 8      <wdr:Property>
> 9        <ex:colour>red</ex:colour>
> 10     </wdr:Property>
> 11   </wdr:hasProperty>
> 
> 12   <wdr:propLookUp rdf:resource="http://sparql.example.com" />
> 
> 13   <wdr:hasNotURI>http://www.example.org/foo/bar.png</wdr:hasNotURI>
> 14 </wdr:Scope>
> 
> 
> The basic idea of an RDF Class containing the definition of the Scope 
> seems straightforward enough?*
> 
> For a given URI, we wish to find out whether the resource to which it 
> resolves is in scope or not. So first split it up into its component 
> parts and then do some pattern matching using (Perl 5) regular expressions.
> 
> Line 2 uses a regular expression to indicate the the scope applies to 
> resources fetched using HTTP. The caret and dollar sign require an exact 
> match so that, for example, HTTPS is not in scope (^https?$ would cover 
> exactly either HTTP or HTTPS).
> 
> Line 3 uses a similar approach to define the scope as being resources on 
> the example.org domain or any subdomain thereof (if you want to restrict 
> it specifically to example.org, put a caret in front of it).
> 
> Importantly, the dollar sign at the end avoids example.org.phishing.com 
> being in scope.
> 
> 4. Line 4 restricts the scope to resources delivered from the given IP 
> address. This could be given as an IP range. Useful for large scale CMS 
> that generates numeric URIs with no easy pattern matching ability??
> 
> Lines 5 and 5 define two elements that must be in the path if a resource 
> is to be in scope. The intention is that, as with all elements here, 
> these should be combined using logical AND. If logic OR is required, 
> they can be presented readily in a single RegExp (foo|bar).
> 
> Lines 7 - 11 are an attempt to handle scoping by property. POWDER would 
> provide a framework for properties to be used in this way but mustn't 
> step over the line to define what kind of properties should be used.
> 
> In line 8 a Property Class is defined.
> 
> Line 9 provides an example to say that a resource must have the property 
> of having the colour red.
> 
> Line 12 is intended to indicate that you can find out whether the 
> resource is red by sending a SPARQL query to http://sparql.example.com. 
> Such provision would be optional since it must cover several use cases:
> 
>   - where the content provider is making DRs available and is able to 
> provide a look up data table for its resources to facilitate grouping.
> 
>   - where the content provider is unable to provide such data and wishes 
> to state that the Description only applies to resources that are red - 
> and you have to fetch the resources to find this out.
> 
> - where a third party is providing DRs and is making an assertion that 
> is only true of red resources.
> 
> For example, they may wish to say that "all documents written in red ink 
> are really hard to read on Mars." In such cases, the assertion remains 
> consistent with or without the look up table/service.
> 
> Is this approach workable? Should we demand SPARQL or make it more 
> generic? In which case we may need something more complex like:
> 
> <wdr:hasPropLookUp>
>    <wdr:PropLookUp>
>      <wdr:propLookUpURI rdf:resource="http://sparql.example.com" />
>      <wdr:propLookUpType 
> rdf:resource="http://www.w3.org/TR/rdf-sparql-query/" />
>    </wdr:PropLookUp>
> </wdr:hasPropLookUp>
> 
> This is more flexible and extensible but it means that a "generic POWDER 
> processor" couldn't be built since it would have to deal with an 
> unbounded number of mechanisms for retrieving property data.
> 
> Specifying SPARQL may limit usefulness for some? Where SPARQL is used, 
> should we actually embed the SPARQL query?
> 
> Back to the original example, line 13 simply states that 
> http://www.example.org/foo/bar.png is not in scope, despite it meeting 
> the other criteria. This serves to exemplify the idea of simply listing 
> URIs as being in/out of scope, and of including negation for all elements.
> 
> As I said at the top - comments welcome.
> 
> Phil.
> 
> 
> * As discussed in the WCL-XG, it would, of course, be perfectly possible 
> to encode Scope using another format, such as XML. This is still being 
> considered by the WG - we could point to an XML literal from the RDF 
> graph, for example.
> 
>
Received on Tuesday, 27 March 2007 16:10:30 UTC