Re: Problems and Opportunities at purl.org from Stian Soiland-Reyes on 2016-02-29 (public-perma-id@w3.org from February 2016)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Mon, 29 Feb 2016 17:32:40 +0000
To: Shane McCarron <shane@aptest.com>
Cc: Pemanent Identifier CG <public-perma-id@w3.org>
Message-ID: <CAPRnXt=9jwVpSD818tt3MEdzRa-KreEiAkgGJ5ZuQ9R=kX9CJA@mail.gmail.com>
On 29 February 2016 at 16:35, Shane McCarron <shane@aptest.com> wrote:
> The only downside to a huge top level .htaccess is the difficulty of editing
> / maintaining it.  Otherwise I am not concerned.  Apache .htaccess
> processing is efficient enough for these purposes imho.

I guess you meant to reply to the list, so I've CCed it in.

Another issue then is if we are to allow editing a CSV file to
re-generate .htaccess (rather than a one-off move), then we have to
extra careful that there aren't any other modifications to the
top-level .htaccess.

I was picturing we could move to a model where you have a folder, like
let's look at
https://github.com/perma-id/w3id.org/blob/master/cwl/
then instead of the current .htaccess there, you could have a CSV file like

https://gist.github.com/stain/c2d668b11b66948b5991

It should be quite easy to generate the corresponding .htaccess from
such files - they can have some headers to warn you:

## DO NOT EDIT
RewriteEngine On
## END DO NOT EDIT


I think we can still do regular expressions, if they start with ^ -
which I think is fair enough)

and the src paths are relative to the folder you are in, so on that
example the one with "context" in src basically means
https://w3id.org/cwl/context

Special case then is for the folder itself, so either . or empty string.




The Very Advanced Edition can allow full paths like /cwl/context  -
where the prefix from the current directory MUST match.  (or we can
say this is the required format, even).  This does however not work on
the regular expression side - as RewriteRules in a folder are relative
to their location (naturally). It's probably better to have a limited
number of options, so it's easy to validate the CSV files before
trying to generate the .htaccess.



> On Mon, Feb 29, 2016 at 10:04 AM, Stian Soiland-Reyes
> <soiland-reyes@cs.manchester.ac.uk> wrote:
>>
>> I started
>> https://github.com/stain/w3id-csv
>>
>> it's quite simple start.. but it uses a CSV file like
>>
>> https://github.com/stain/w3id-csv/blob/master/purl_example.csv
>> which matches the schema David Wood mentioned.
>>
>> and then generates a bunch of .htaccess files.
>>
>> You can test it on a dummy install of Apache httpd with Docker - see the
>> README.
>>
>>
>> Obviously now this script is quite naive in that it makes a folder for
>> every purl.org entry, which (in addition to making loads of files)
>> would be a bit wrong (e.g. the purl /fred/soup.html  would make the
>> fred/soup.html/.htaccess which would mean an intermediate HTTP
>> redirect from soup.html to soup.html/  -- and I've not gone through
>> the different types yet to do subtree matching or the correct HTTP
>> redirection status code.
>>
>> So one simple improvement would be to check if the path ends with a /
>> in purl.org or not - and then group those entries within the parent
>> path so there would be a bigger .htaccess.  However I think we want to
>> avoid a single large top-level .htaccess for registrations like
>> http://purl.org/pav  without a trailing / ?
>>
>>
>> As for conflicts this should be modified to only replace it's "own"
>> files by having a magic "#header".
>>
>> We also talked about having a "native" CSV file approach for w3id.org
>> - so this could be modified then to have a better file format that we
>> can convert the purl.org dump into.
>>
>>
>>
>>
>> On 29 February 2016 at 12:29, Stian Soiland-Reyes
>> <soiland-reyes@cs.manchester.ac.uk> wrote:
>> > Yeah, let's get this going.
>> >
>> > So looking at the purl database schema we don't really need the group
>> > and user stuff to start with (although that could be added to the
>> > README).
>> >
>> > the purls table itself should be sufficient to start. We can find the
>> > different "type" values in the purl.org source code I think?
>> >
>> >
>> >
>> > On 29 February 2016 at 11:58, Norman Gray <norman@astro.gla.ac.uk>
>> > wrote:
>> >>
>> >> Greetings, all.
>> >>
>> >> A little while ago (and this message is a reply to
>> >>
>> >> <https://lists.w3.org/Archives/Public/public-perma-id/2015Dec/0001.html>, to
>> >> resuscitate the thread), there was some interest expressed in a
>> >> purl.org
>> >> successor.  That thread ended on a positive note, with David Wood and
>> >> some
>> >> others having access to the schema, and OCLC apparently keen on passing
>> >> forward the current repository.
>> >>
>> >> I was asked about purl.org by a colleague today, and this reminded me
>> >> about
>> >> last November/December's thread: is there any news about purl.org or
>> >> the
>> >> broader preservation plan, that can be passed on?  Or is there any way
>> >> that
>> >> I or others could help with this?
>> >>
>> >>
>> >> All the best,
>> >>
>> >> Norman
>> >>
>> >>
>> >> --
>> >> Norman Gray  :  https://nxg.me.uk
>> >> SUPA School of Physics and Astronomy, University of Glasgow, UK
>> >>
>> >
>> >
>> >
>> > --
>> > Stian Soiland-Reyes, eScience Lab
>> > School of Computer Science
>> > The University of Manchester
>> > http://soiland-reyes.com/stian/work/
>> > http://orcid.org/0000-0001-9842-9718
>>
>>
>>
>> --
>> Stian Soiland-Reyes, eScience Lab
>> School of Computer Science
>> The University of Manchester
>> http://soiland-reyes.com/stian/work/
>> http://orcid.org/0000-0001-9842-9718
>>
>
>
>
> --
> Shane McCarron
> Managing Director, Applied Testing and Technology, Inc.



-- 
Stian Soiland-Reyes, eScience Lab
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/    http://orcid.org/0000-0001-9842-9718
Received on Monday, 29 February 2016 17:33:29 UTC