Re: Problems and Opportunities at purl.org from Shane McCarron on 2016-02-29 (public-perma-id@w3.org from February 2016)

From: Shane McCarron <shane@aptest.com>
Date: Mon, 29 Feb 2016 12:07:35 -0600
To: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Cc: Pemanent Identifier CG <public-perma-id@w3.org>
Message-ID: <CAOk_reE8VM_naNOswWUFA1PKP8uFnvW+fTW599OBMGT-radsPw@mail.gmail.com>
travis with commit rights seems like the right level of granularity, and I
am good with the DO NOT MODIFY block.  With the caveat that there MUST be
such a block in place in order for rules.csv to be processed at all?  That
way there is no room for error.

As to how it is done now.... there is a script that gets run to do an
update.  But I don't know how it is triggered.  @davidlehn do you remember?

On Mon, Feb 29, 2016 at 11:57 AM, Stian Soiland-Reyes <
soiland-reyes@cs.manchester.ac.uk> wrote:

> I think a dual approach - I don't want to dismiss the existing
> .htaccess files, some of which might not fit whatever format we decode
> on those CSV files.
>
>
> I would prefer the auto-updated section to be the one that is marked out,
> e.g.
>
> ## DO NOT MODIFY section below - auto-generated from rules.csv ##
> RewriteEngine On
> RewriteRule ...
>
> ## END autogenerated  ##
>
> RewriteRule .* ...SomethingWeird [L,funny=true]
>
>
>
>
> There are no post-hooks on GitHub, you would have to do the processing
> on deployment time (how is it done currently? Just git pull in the
> right folder? cronjob?)
>
> You can have a GitHub web-hook to trigger something "elsewhere", or
> have a Travis job with commit rights (using some Travis secrets
> feature).  (It would obviously need to ignore just its own commits)
>
> On 29 February 2016 at 17:45, Shane McCarron <shane@aptest.com> wrote:
> > In general I don't *hate* the idea if permitting the use of CSV files to
> > drive the creation / updating of the .htaccess files.  But I would prefer
> > this to be an option.  I think my mental model was that this was a one
> time
> > migration from purl.org - after that we would just use .htaccess files
> as we
> > have been.  But I appreciate the thought that this might be overly
> onerous
> > for some significant number of potential users.  Editing those things is
> not
> > for the meek!
> >
> > What would people think about a rule set like:
> >
> > 1. If there is a .htaccess file in a directory, that file can have
> sections
> > in it that are demarked and will never be automatically modified.
> > 2. If there is a rules.csv file in a directory, that file contains
> mapping
> > rules that will update the (non-demarked) parts of the .htaccess file in
> the
> > directory (creating the file if necessary)
> >
> > I haven't tried to implement this sort of github post-push processing
> magic
> > on branches / pull requests before.  Is that even possible?
> >
> >
> > On Mon, Feb 29, 2016 at 11:32 AM, Stian Soiland-Reyes
> > <soiland-reyes@cs.manchester.ac.uk> wrote:
> >>
> >> On 29 February 2016 at 16:35, Shane McCarron <shane@aptest.com> wrote:
> >> > The only downside to a huge top level .htaccess is the difficulty of
> >> > editing
> >> > / maintaining it.  Otherwise I am not concerned.  Apache .htaccess
> >> > processing is efficient enough for these purposes imho.
> >>
> >> I guess you meant to reply to the list, so I've CCed it in.
> >>
> >> Another issue then is if we are to allow editing a CSV file to
> >> re-generate .htaccess (rather than a one-off move), then we have to
> >> extra careful that there aren't any other modifications to the
> >> top-level .htaccess.
> >>
> >> I was picturing we could move to a model where you have a folder, like
> >> let's look at
> >> https://github.com/perma-id/w3id.org/blob/master/cwl/
> >> then instead of the current .htaccess there, you could have a CSV file
> >> like
> >>
> >> https://gist.github.com/stain/c2d668b11b66948b5991
> >>
> >> It should be quite easy to generate the corresponding .htaccess from
> >> such files - they can have some headers to warn you:
> >>
> >> ## DO NOT EDIT
> >> RewriteEngine On
> >> ## END DO NOT EDIT
> >>
> >>
> >> I think we can still do regular expressions, if they start with ^ -
> >> which I think is fair enough)
> >>
> >> and the src paths are relative to the folder you are in, so on that
> >> example the one with "context" in src basically means
> >> https://w3id.org/cwl/context
> >>
> >> Special case then is for the folder itself, so either . or empty string.
> >>
> >>
> >>
> >>
> >> The Very Advanced Edition can allow full paths like /cwl/context  -
> >> where the prefix from the current directory MUST match.  (or we can
> >> say this is the required format, even).  This does however not work on
> >> the regular expression side - as RewriteRules in a folder are relative
> >> to their location (naturally). It's probably better to have a limited
> >> number of options, so it's easy to validate the CSV files before
> >> trying to generate the .htaccess.
> >>
> >>
> >>
> >> > On Mon, Feb 29, 2016 at 10:04 AM, Stian Soiland-Reyes
> >> > <soiland-reyes@cs.manchester.ac.uk> wrote:
> >> >>
> >> >> I started
> >> >> https://github.com/stain/w3id-csv
> >> >>
> >> >> it's quite simple start.. but it uses a CSV file like
> >> >>
> >> >> https://github.com/stain/w3id-csv/blob/master/purl_example.csv
> >> >> which matches the schema David Wood mentioned.
> >> >>
> >> >> and then generates a bunch of .htaccess files.
> >> >>
> >> >> You can test it on a dummy install of Apache httpd with Docker - see
> >> >> the
> >> >> README.
> >> >>
> >> >>
> >> >> Obviously now this script is quite naive in that it makes a folder
> for
> >> >> every purl.org entry, which (in addition to making loads of files)
> >> >> would be a bit wrong (e.g. the purl /fred/soup.html  would make the
> >> >> fred/soup.html/.htaccess which would mean an intermediate HTTP
> >> >> redirect from soup.html to soup.html/  -- and I've not gone through
> >> >> the different types yet to do subtree matching or the correct HTTP
> >> >> redirection status code.
> >> >>
> >> >> So one simple improvement would be to check if the path ends with a /
> >> >> in purl.org or not - and then group those entries within the parent
> >> >> path so there would be a bigger .htaccess.  However I think we want
> to
> >> >> avoid a single large top-level .htaccess for registrations like
> >> >> http://purl.org/pav  without a trailing / ?
> >> >>
> >> >>
> >> >> As for conflicts this should be modified to only replace it's "own"
> >> >> files by having a magic "#header".
> >> >>
> >> >> We also talked about having a "native" CSV file approach for
> w3id.org
> >> >> - so this could be modified then to have a better file format that we
> >> >> can convert the purl.org dump into.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 29 February 2016 at 12:29, Stian Soiland-Reyes
> >> >> <soiland-reyes@cs.manchester.ac.uk> wrote:
> >> >> > Yeah, let's get this going.
> >> >> >
> >> >> > So looking at the purl database schema we don't really need the
> group
> >> >> > and user stuff to start with (although that could be added to the
> >> >> > README).
> >> >> >
> >> >> > the purls table itself should be sufficient to start. We can find
> the
> >> >> > different "type" values in the purl.org source code I think?
> >> >> >
> >> >> >
> >> >> >
> >> >> > On 29 February 2016 at 11:58, Norman Gray <norman@astro.gla.ac.uk>
> >> >> > wrote:
> >> >> >>
> >> >> >> Greetings, all.
> >> >> >>
> >> >> >> A little while ago (and this message is a reply to
> >> >> >>
> >> >> >>
> >> >> >> <
> https://lists.w3.org/Archives/Public/public-perma-id/2015Dec/0001.html>,
> to
> >> >> >> resuscitate the thread), there was some interest expressed in a
> >> >> >> purl.org
> >> >> >> successor.  That thread ended on a positive note, with David Wood
> >> >> >> and
> >> >> >> some
> >> >> >> others having access to the schema, and OCLC apparently keen on
> >> >> >> passing
> >> >> >> forward the current repository.
> >> >> >>
> >> >> >> I was asked about purl.org by a colleague today, and this
> reminded
> >> >> >> me
> >> >> >> about
> >> >> >> last November/December's thread: is there any news about purl.org
> or
> >> >> >> the
> >> >> >> broader preservation plan, that can be passed on?  Or is there any
> >> >> >> way
> >> >> >> that
> >> >> >> I or others could help with this?
> >> >> >>
> >> >> >>
> >> >> >> All the best,
> >> >> >>
> >> >> >> Norman
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Norman Gray  :  https://nxg.me.uk
> >> >> >> SUPA School of Physics and Astronomy, University of Glasgow, UK
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Stian Soiland-Reyes, eScience Lab
> >> >> > School of Computer Science
> >> >> > The University of Manchester
> >> >> > http://soiland-reyes.com/stian/work/
> >> >> > http://orcid.org/0000-0001-9842-9718
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Stian Soiland-Reyes, eScience Lab
> >> >> School of Computer Science
> >> >> The University of Manchester
> >> >> http://soiland-reyes.com/stian/work/
> >> >> http://orcid.org/0000-0001-9842-9718
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Shane McCarron
> >> > Managing Director, Applied Testing and Technology, Inc.
> >>
> >>
> >>
> >> --
> >> Stian Soiland-Reyes, eScience Lab
> >> School of Computer Science
> >> The University of Manchester
> >> http://soiland-reyes.com/stian/work/
> >> http://orcid.org/0000-0001-9842-9718
> >
> >
> >
> >
> > --
> > Shane McCarron
> > Managing Director, Applied Testing and Technology, Inc.
>
>
>
> --
> Stian Soiland-Reyes, eScience Lab
> School of Computer Science
> The University of Manchester
> http://soiland-reyes.com/stian/work/
> http://orcid.org/0000-0001-9842-9718
>



-- 
Shane McCarron
Managing Director, Applied Testing and Technology, Inc.
Received on Monday, 29 February 2016 18:08:06 UTC