W3C home > Mailing lists > Public > public-perma-id@w3.org > February 2016

Re: Problems and Opportunities at purl.org

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Mon, 29 Feb 2016 17:57:43 +0000
Message-ID: <CAPRnXtk7H_H4Xdp3BCGfR9h0yGJH=yejvp70VeswVy8+wP3m5Q@mail.gmail.com>
To: Shane McCarron <shane@aptest.com>
Cc: Pemanent Identifier CG <public-perma-id@w3.org>
I think a dual approach - I don't want to dismiss the existing
.htaccess files, some of which might not fit whatever format we decode
on those CSV files.


I would prefer the auto-updated section to be the one that is marked out, e.g.

## DO NOT MODIFY section below - auto-generated from rules.csv ##
RewriteEngine On
RewriteRule ...

## END autogenerated  ##

RewriteRule .* ...SomethingWeird [L,funny=true]




There are no post-hooks on GitHub, you would have to do the processing
on deployment time (how is it done currently? Just git pull in the
right folder? cronjob?)

You can have a GitHub web-hook to trigger something "elsewhere", or
have a Travis job with commit rights (using some Travis secrets
feature).  (It would obviously need to ignore just its own commits)

On 29 February 2016 at 17:45, Shane McCarron <shane@aptest.com> wrote:
> In general I don't *hate* the idea if permitting the use of CSV files to
> drive the creation / updating of the .htaccess files.  But I would prefer
> this to be an option.  I think my mental model was that this was a one time
> migration from purl.org - after that we would just use .htaccess files as we
> have been.  But I appreciate the thought that this might be overly onerous
> for some significant number of potential users.  Editing those things is not
> for the meek!
>
> What would people think about a rule set like:
>
> 1. If there is a .htaccess file in a directory, that file can have sections
> in it that are demarked and will never be automatically modified.
> 2. If there is a rules.csv file in a directory, that file contains mapping
> rules that will update the (non-demarked) parts of the .htaccess file in the
> directory (creating the file if necessary)
>
> I haven't tried to implement this sort of github post-push processing magic
> on branches / pull requests before.  Is that even possible?
>
>
> On Mon, Feb 29, 2016 at 11:32 AM, Stian Soiland-Reyes
> <soiland-reyes@cs.manchester.ac.uk> wrote:
>>
>> On 29 February 2016 at 16:35, Shane McCarron <shane@aptest.com> wrote:
>> > The only downside to a huge top level .htaccess is the difficulty of
>> > editing
>> > / maintaining it.  Otherwise I am not concerned.  Apache .htaccess
>> > processing is efficient enough for these purposes imho.
>>
>> I guess you meant to reply to the list, so I've CCed it in.
>>
>> Another issue then is if we are to allow editing a CSV file to
>> re-generate .htaccess (rather than a one-off move), then we have to
>> extra careful that there aren't any other modifications to the
>> top-level .htaccess.
>>
>> I was picturing we could move to a model where you have a folder, like
>> let's look at
>> https://github.com/perma-id/w3id.org/blob/master/cwl/
>> then instead of the current .htaccess there, you could have a CSV file
>> like
>>
>> https://gist.github.com/stain/c2d668b11b66948b5991
>>
>> It should be quite easy to generate the corresponding .htaccess from
>> such files - they can have some headers to warn you:
>>
>> ## DO NOT EDIT
>> RewriteEngine On
>> ## END DO NOT EDIT
>>
>>
>> I think we can still do regular expressions, if they start with ^ -
>> which I think is fair enough)
>>
>> and the src paths are relative to the folder you are in, so on that
>> example the one with "context" in src basically means
>> https://w3id.org/cwl/context
>>
>> Special case then is for the folder itself, so either . or empty string.
>>
>>
>>
>>
>> The Very Advanced Edition can allow full paths like /cwl/context  -
>> where the prefix from the current directory MUST match.  (or we can
>> say this is the required format, even).  This does however not work on
>> the regular expression side - as RewriteRules in a folder are relative
>> to their location (naturally). It's probably better to have a limited
>> number of options, so it's easy to validate the CSV files before
>> trying to generate the .htaccess.
>>
>>
>>
>> > On Mon, Feb 29, 2016 at 10:04 AM, Stian Soiland-Reyes
>> > <soiland-reyes@cs.manchester.ac.uk> wrote:
>> >>
>> >> I started
>> >> https://github.com/stain/w3id-csv
>> >>
>> >> it's quite simple start.. but it uses a CSV file like
>> >>
>> >> https://github.com/stain/w3id-csv/blob/master/purl_example.csv
>> >> which matches the schema David Wood mentioned.
>> >>
>> >> and then generates a bunch of .htaccess files.
>> >>
>> >> You can test it on a dummy install of Apache httpd with Docker - see
>> >> the
>> >> README.
>> >>
>> >>
>> >> Obviously now this script is quite naive in that it makes a folder for
>> >> every purl.org entry, which (in addition to making loads of files)
>> >> would be a bit wrong (e.g. the purl /fred/soup.html  would make the
>> >> fred/soup.html/.htaccess which would mean an intermediate HTTP
>> >> redirect from soup.html to soup.html/  -- and I've not gone through
>> >> the different types yet to do subtree matching or the correct HTTP
>> >> redirection status code.
>> >>
>> >> So one simple improvement would be to check if the path ends with a /
>> >> in purl.org or not - and then group those entries within the parent
>> >> path so there would be a bigger .htaccess.  However I think we want to
>> >> avoid a single large top-level .htaccess for registrations like
>> >> http://purl.org/pav  without a trailing / ?
>> >>
>> >>
>> >> As for conflicts this should be modified to only replace it's "own"
>> >> files by having a magic "#header".
>> >>
>> >> We also talked about having a "native" CSV file approach for w3id.org
>> >> - so this could be modified then to have a better file format that we
>> >> can convert the purl.org dump into.
>> >>
>> >>
>> >>
>> >>
>> >> On 29 February 2016 at 12:29, Stian Soiland-Reyes
>> >> <soiland-reyes@cs.manchester.ac.uk> wrote:
>> >> > Yeah, let's get this going.
>> >> >
>> >> > So looking at the purl database schema we don't really need the group
>> >> > and user stuff to start with (although that could be added to the
>> >> > README).
>> >> >
>> >> > the purls table itself should be sufficient to start. We can find the
>> >> > different "type" values in the purl.org source code I think?
>> >> >
>> >> >
>> >> >
>> >> > On 29 February 2016 at 11:58, Norman Gray <norman@astro.gla.ac.uk>
>> >> > wrote:
>> >> >>
>> >> >> Greetings, all.
>> >> >>
>> >> >> A little while ago (and this message is a reply to
>> >> >>
>> >> >>
>> >> >> <https://lists.w3.org/Archives/Public/public-perma-id/2015Dec/0001.html>, to
>> >> >> resuscitate the thread), there was some interest expressed in a
>> >> >> purl.org
>> >> >> successor.  That thread ended on a positive note, with David Wood
>> >> >> and
>> >> >> some
>> >> >> others having access to the schema, and OCLC apparently keen on
>> >> >> passing
>> >> >> forward the current repository.
>> >> >>
>> >> >> I was asked about purl.org by a colleague today, and this reminded
>> >> >> me
>> >> >> about
>> >> >> last November/December's thread: is there any news about purl.org or
>> >> >> the
>> >> >> broader preservation plan, that can be passed on?  Or is there any
>> >> >> way
>> >> >> that
>> >> >> I or others could help with this?
>> >> >>
>> >> >>
>> >> >> All the best,
>> >> >>
>> >> >> Norman
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Norman Gray  :  https://nxg.me.uk
>> >> >> SUPA School of Physics and Astronomy, University of Glasgow, UK
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Stian Soiland-Reyes, eScience Lab
>> >> > School of Computer Science
>> >> > The University of Manchester
>> >> > http://soiland-reyes.com/stian/work/
>> >> > http://orcid.org/0000-0001-9842-9718
>> >>
>> >>
>> >>
>> >> --
>> >> Stian Soiland-Reyes, eScience Lab
>> >> School of Computer Science
>> >> The University of Manchester
>> >> http://soiland-reyes.com/stian/work/
>> >> http://orcid.org/0000-0001-9842-9718
>> >>
>> >
>> >
>> >
>> > --
>> > Shane McCarron
>> > Managing Director, Applied Testing and Technology, Inc.
>>
>>
>>
>> --
>> Stian Soiland-Reyes, eScience Lab
>> School of Computer Science
>> The University of Manchester
>> http://soiland-reyes.com/stian/work/
>> http://orcid.org/0000-0001-9842-9718
>
>
>
>
> --
> Shane McCarron
> Managing Director, Applied Testing and Technology, Inc.



-- 
Stian Soiland-Reyes, eScience Lab
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/    http://orcid.org/0000-0001-9842-9718
Received on Monday, 29 February 2016 17:58:34 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:43:41 UTC