W3C home > Mailing lists > Public > public-perma-id@w3.org > February 2016

Re: Problems and Opportunities at purl.org

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Mon, 29 Feb 2016 21:06:59 +0000
Message-ID: <CAPRnXtmjSO0NY-oAWV69KuT==Z5QqFw3n8FqfvLf7U1BoRKBxg@mail.gmail.com>
To: Shane McCarron <shane@aptest.com>
Cc: Pemanent Identifier CG <public-perma-id@w3.org>
Yeah, that sounds good.. rules.csv would be processed only if there is
no .htaccess at all, or one with an existing DO NOT MODIFY block.  If
the block goes away.. so does the auto-generation.

If I start hacking on this, what is available on the machine that can
run the hook? I used Python 3, is that OK? Is that maintainable by the
rest of you, or do you prefer something else?


As for what happens with in-place replacing, we can do it via
temporary file with an atomic move in the end, that's best in case it
goes down mid-way through the update as well. (Sadly it's not possible
to do a atomic swap - so you would still be exposed for half a
millisecond between deleting .htaccess and linking .htaccess.12981982
to .htaccess -- but git update would be having similar glitch
milliseconds).

Or if it's commiting back, then it could be two separate steps.. which
is probably good in case someone breaks their CSV file - then the old
.htaccess remains - yet the other .htaccess files can still be
updated.

On 29 February 2016 at 18:07, Shane McCarron <shane@aptest.com> wrote:
> travis with commit rights seems like the right level of granularity, and I
> am good with the DO NOT MODIFY block.  With the caveat that there MUST be
> such a block in place in order for rules.csv to be processed at all?  That
> way there is no room for error.
>
> As to how it is done now.... there is a script that gets run to do an
> update.  But I don't know how it is triggered.  @davidlehn do you remember?
>
> On Mon, Feb 29, 2016 at 11:57 AM, Stian Soiland-Reyes
> <soiland-reyes@cs.manchester.ac.uk> wrote:
>>
>> I think a dual approach - I don't want to dismiss the existing
>> .htaccess files, some of which might not fit whatever format we decode
>> on those CSV files.
>>
>>
>> I would prefer the auto-updated section to be the one that is marked out,
>> e.g.
>>
>> ## DO NOT MODIFY section below - auto-generated from rules.csv ##
>> RewriteEngine On
>> RewriteRule ...
>>
>> ## END autogenerated  ##
>>
>> RewriteRule .* ...SomethingWeird [L,funny=true]
>>
>>
>>
>>
>> There are no post-hooks on GitHub, you would have to do the processing
>> on deployment time (how is it done currently? Just git pull in the
>> right folder? cronjob?)
>>
>> You can have a GitHub web-hook to trigger something "elsewhere", or
>> have a Travis job with commit rights (using some Travis secrets
>> feature).  (It would obviously need to ignore just its own commits)
>>
>> On 29 February 2016 at 17:45, Shane McCarron <shane@aptest.com> wrote:
>> > In general I don't *hate* the idea if permitting the use of CSV files to
>> > drive the creation / updating of the .htaccess files.  But I would
>> > prefer
>> > this to be an option.  I think my mental model was that this was a one
>> > time
>> > migration from purl.org - after that we would just use .htaccess files
>> > as we
>> > have been.  But I appreciate the thought that this might be overly
>> > onerous
>> > for some significant number of potential users.  Editing those things is
>> > not
>> > for the meek!
>> >
>> > What would people think about a rule set like:
>> >
>> > 1. If there is a .htaccess file in a directory, that file can have
>> > sections
>> > in it that are demarked and will never be automatically modified.
>> > 2. If there is a rules.csv file in a directory, that file contains
>> > mapping
>> > rules that will update the (non-demarked) parts of the .htaccess file in
>> > the
>> > directory (creating the file if necessary)
>> >
>> > I haven't tried to implement this sort of github post-push processing
>> > magic
>> > on branches / pull requests before.  Is that even possible?
>> >
>> >
>> > On Mon, Feb 29, 2016 at 11:32 AM, Stian Soiland-Reyes
>> > <soiland-reyes@cs.manchester.ac.uk> wrote:
>> >>
>> >> On 29 February 2016 at 16:35, Shane McCarron <shane@aptest.com> wrote:
>> >> > The only downside to a huge top level .htaccess is the difficulty of
>> >> > editing
>> >> > / maintaining it.  Otherwise I am not concerned.  Apache .htaccess
>> >> > processing is efficient enough for these purposes imho.
>> >>
>> >> I guess you meant to reply to the list, so I've CCed it in.
>> >>
>> >> Another issue then is if we are to allow editing a CSV file to
>> >> re-generate .htaccess (rather than a one-off move), then we have to
>> >> extra careful that there aren't any other modifications to the
>> >> top-level .htaccess.
>> >>
>> >> I was picturing we could move to a model where you have a folder, like
>> >> let's look at
>> >> https://github.com/perma-id/w3id.org/blob/master/cwl/
>> >> then instead of the current .htaccess there, you could have a CSV file
>> >> like
>> >>
>> >> https://gist.github.com/stain/c2d668b11b66948b5991
>> >>
>> >> It should be quite easy to generate the corresponding .htaccess from
>> >> such files - they can have some headers to warn you:
>> >>
>> >> ## DO NOT EDIT
>> >> RewriteEngine On
>> >> ## END DO NOT EDIT
>> >>
>> >>
>> >> I think we can still do regular expressions, if they start with ^ -
>> >> which I think is fair enough)
>> >>
>> >> and the src paths are relative to the folder you are in, so on that
>> >> example the one with "context" in src basically means
>> >> https://w3id.org/cwl/context
>> >>
>> >> Special case then is for the folder itself, so either . or empty
>> >> string.
>> >>
>> >>
>> >>
>> >>
>> >> The Very Advanced Edition can allow full paths like /cwl/context  -
>> >> where the prefix from the current directory MUST match.  (or we can
>> >> say this is the required format, even).  This does however not work on
>> >> the regular expression side - as RewriteRules in a folder are relative
>> >> to their location (naturally). It's probably better to have a limited
>> >> number of options, so it's easy to validate the CSV files before
>> >> trying to generate the .htaccess.
>> >>
>> >>
>> >>
>> >> > On Mon, Feb 29, 2016 at 10:04 AM, Stian Soiland-Reyes
>> >> > <soiland-reyes@cs.manchester.ac.uk> wrote:
>> >> >>
>> >> >> I started
>> >> >> https://github.com/stain/w3id-csv
>> >> >>
>> >> >> it's quite simple start.. but it uses a CSV file like
>> >> >>
>> >> >> https://github.com/stain/w3id-csv/blob/master/purl_example.csv
>> >> >> which matches the schema David Wood mentioned.
>> >> >>
>> >> >> and then generates a bunch of .htaccess files.
>> >> >>
>> >> >> You can test it on a dummy install of Apache httpd with Docker - see
>> >> >> the
>> >> >> README.
>> >> >>
>> >> >>
>> >> >> Obviously now this script is quite naive in that it makes a folder
>> >> >> for
>> >> >> every purl.org entry, which (in addition to making loads of files)
>> >> >> would be a bit wrong (e.g. the purl /fred/soup.html  would make the
>> >> >> fred/soup.html/.htaccess which would mean an intermediate HTTP
>> >> >> redirect from soup.html to soup.html/  -- and I've not gone through
>> >> >> the different types yet to do subtree matching or the correct HTTP
>> >> >> redirection status code.
>> >> >>
>> >> >> So one simple improvement would be to check if the path ends with a
>> >> >> /
>> >> >> in purl.org or not - and then group those entries within the parent
>> >> >> path so there would be a bigger .htaccess.  However I think we want
>> >> >> to
>> >> >> avoid a single large top-level .htaccess for registrations like
>> >> >> http://purl.org/pav  without a trailing / ?
>> >> >>
>> >> >>
>> >> >> As for conflicts this should be modified to only replace it's "own"
>> >> >> files by having a magic "#header".
>> >> >>
>> >> >> We also talked about having a "native" CSV file approach for
>> >> >> w3id.org
>> >> >> - so this could be modified then to have a better file format that
>> >> >> we
>> >> >> can convert the purl.org dump into.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 29 February 2016 at 12:29, Stian Soiland-Reyes
>> >> >> <soiland-reyes@cs.manchester.ac.uk> wrote:
>> >> >> > Yeah, let's get this going.
>> >> >> >
>> >> >> > So looking at the purl database schema we don't really need the
>> >> >> > group
>> >> >> > and user stuff to start with (although that could be added to the
>> >> >> > README).
>> >> >> >
>> >> >> > the purls table itself should be sufficient to start. We can find
>> >> >> > the
>> >> >> > different "type" values in the purl.org source code I think?
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 29 February 2016 at 11:58, Norman Gray <norman@astro.gla.ac.uk>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Greetings, all.
>> >> >> >>
>> >> >> >> A little while ago (and this message is a reply to
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> <https://lists.w3.org/Archives/Public/public-perma-id/2015Dec/0001.html>, to
>> >> >> >> resuscitate the thread), there was some interest expressed in a
>> >> >> >> purl.org
>> >> >> >> successor.  That thread ended on a positive note, with David Wood
>> >> >> >> and
>> >> >> >> some
>> >> >> >> others having access to the schema, and OCLC apparently keen on
>> >> >> >> passing
>> >> >> >> forward the current repository.
>> >> >> >>
>> >> >> >> I was asked about purl.org by a colleague today, and this
>> >> >> >> reminded
>> >> >> >> me
>> >> >> >> about
>> >> >> >> last November/December's thread: is there any news about purl.org
>> >> >> >> or
>> >> >> >> the
>> >> >> >> broader preservation plan, that can be passed on?  Or is there
>> >> >> >> any
>> >> >> >> way
>> >> >> >> that
>> >> >> >> I or others could help with this?
>> >> >> >>
>> >> >> >>
>> >> >> >> All the best,
>> >> >> >>
>> >> >> >> Norman
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Norman Gray  :  https://nxg.me.uk
>> >> >> >> SUPA School of Physics and Astronomy, University of Glasgow, UK
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Stian Soiland-Reyes, eScience Lab
>> >> >> > School of Computer Science
>> >> >> > The University of Manchester
>> >> >> > http://soiland-reyes.com/stian/work/
>> >> >> > http://orcid.org/0000-0001-9842-9718
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Stian Soiland-Reyes, eScience Lab
>> >> >> School of Computer Science
>> >> >> The University of Manchester
>> >> >> http://soiland-reyes.com/stian/work/
>> >> >> http://orcid.org/0000-0001-9842-9718
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Shane McCarron
>> >> > Managing Director, Applied Testing and Technology, Inc.
>> >>
>> >>
>> >>
>> >> --
>> >> Stian Soiland-Reyes, eScience Lab
>> >> School of Computer Science
>> >> The University of Manchester
>> >> http://soiland-reyes.com/stian/work/
>> >> http://orcid.org/0000-0001-9842-9718
>> >
>> >
>> >
>> >
>> > --
>> > Shane McCarron
>> > Managing Director, Applied Testing and Technology, Inc.
>>
>>
>>
>> --
>> Stian Soiland-Reyes, eScience Lab
>> School of Computer Science
>> The University of Manchester
>> http://soiland-reyes.com/stian/work/
>> http://orcid.org/0000-0001-9842-9718
>
>
>
>
> --
> Shane McCarron
> Managing Director, Applied Testing and Technology, Inc.



-- 
Stian Soiland-Reyes, eScience Lab
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/    http://orcid.org/0000-0001-9842-9718
Received on Monday, 29 February 2016 21:07:52 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:43:41 UTC