Re: Rules.CSV format as alternative to htaccess from Norman Gray on 2016-03-03 (public-perma-id@w3.org from March 2016)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Thu, 03 Mar 2016 13:00:41 +0000
To: "Stian Soiland-Reyes" <soiland-reyes@cs.manchester.ac.uk>
Cc: "Daniel Garijo" <dgarijo@fi.upm.es>, "Ian Dunlop" <ianwdunlop@gmail.com>, "Pemanent Identifier CG" <public-perma-id@w3.org>
Message-ID: <7DC44765-2E95-46D4-A960-B1FC9B8ABE6C@astro.gla.ac.uk>

Stian and all, hello.

On 3 Mar 2016, at 11:09, Stian Soiland-Reyes wrote:

> One thing that comes up when I'm looking at the purl.org redirections
> is that there's often the with and without slash variants.. e.g.
> http://purl.org/pav and http://purl.org/pav/hasVersion -- it would be
> good if these could be represented within /pav/rules.csv  rather than
> also have a line in /rules.csv  -- perhaps the special values should
> be empty string for "folder" and "." for "folder/".

That could be represented by a magic value in the "src" column -- say 
"<" or "<>" (which can't appear in a URI path segment).  Presuming that 
the implementation would not parse each rules.csv file dynamically, but 
would ingest them in a preprocessing step, the fact that this applies to 
the 'parent' path need not be a problem.

> As for the existing w3id.org htaccess rules, I think any non-slash
> folder usage now is indirect through Apache's own directory matching,
> e.g. https://w3id.org/bundle works as it should by a 301 Moved
> Permanently to https://w3id.org/bundle/ which then does 302 Found to
> its final destination.

These and others could potentially be handled by a couple of lines which 
are implicitly appended to each rules.csv file.  If the logic is that 
the first input URIs are handled by the first row in rules.csv which has 
a match in column 1, then these can be overridden easily, but still 
provide consistent behaviour.

The (usual apache) adding-slash behaviour would be

"^(.*)$","$1/",301,

and error behaviour would be the catch-all

"^.*$","http://purl.org/admin/error.html",404,

It might also be worth specifying that the beginning and end of column 1 
are implicitly anchored with "^...$', where the beginning matches the 
beginning of the part of the URI path component which starts at the 
current path component.  Anchoring both ends would be useful since (a) 
it would probably be good practice to anchor patterns explicitly in any 
case, and (b) it fits in with the most natural/naive reading of the list 
which would have the first column match on path elements (ie, principle 
of least surprise).

Thus adapting your example, we might have

"<>","http://example.com/home",302,
"","http://example.com/home-dir","302",""
"sub/folders/allowed","http://example.com/flat.html","302",""
"sub/folders/allowed.*","http://example.com/flatter.html","302",""
"blog/(.*)/","http://example.com/blog/post/$1/","302",""

Since 302 Found would be the most typical status code for this service, 
perhaps that could be the default if the "statuscode" column is empty.

If this was in a directory "foo" (or, more abstractly, if this were 
being interpreted in a 'context' 'foo'), then we'd have mappings

.../foo -> http://example.com/home
.../foo/ -> http://example.com/home-dir
.../foo/sub/folders/allowed -> http://example.com/flat.html
.../foo/sub/folders/allowed/bar -> http://example.com/flatter.html
.../foo/blog/wibble/ -> http://example.com/blog/post/wibble/
.../foo/blog/wibble/woot/ -> http://example.com/blog/post/wibble/woot/
.../foo/blog/wibble/woot -> http://purl.org/admin/error.html
.../foo/myblog/stuff -> http://purl.org/admin/error.html (and not 
http://example.com/blog/post/stuff)

All the best,

Norman

-- 
Norman Gray  :  https://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

Received on Thursday, 3 March 2016 13:01:13 UTC