Re: Limitations of the service? from Car, Nicholas (L&W, Dutton Park) on 2018-07-09 (public-perma-id@w3.org from July 2018)

From: Car, Nicholas (L&W, Dutton Park) <Nicholas.Car@csiro.au>
Date: Mon, 9 Jul 2018 21:01:28 +0000
To: "David I. Lehn" <dil@lehn.org>, Pierre van Houtryve <pierre.vanhoutryve@gmail.com>
CC: Pemanent Identifier CG <public-perma-id@w3.org>
Message-ID: <1AFD46A8-3DD4-490A-8EA1-C4D111E1A69E@csiro.au>
We developed a tool called the Persistent ID Service (PIDSvc) [1, 2] over 6 years ago to deal with a range of PID Use Cases that Apache-style redirects weren't great at handling. One of the Use Cases was for large numbers of 1:1 redirects or proxy matches where a pattern couldn't be used.

The PID Svc uses a Postgres DB for match storage so it scales as well as any normal Postgres installation meaning that you can easily cater for millions (or even billions!) of matches. Also, because the matches are just in Postgres, you can load it from CSV if you like and we've done that lots. We've used the tool now in operational settings for the Australian government and elsewhere for all the 6 years since its release so it’s very stable. 

We're very happy to help you look into this tool if you think it would be appropriate for your use case.

Regards, 

Nick

[1] https://www.seegrid.csiro.au/wiki/Siss/PIDService - tool's home online including documentation
[2] https://mssanz.org.au/modsim2015/C8/golodoniuc.pdf - descriptive paper

Nicholas Car
Senior Experimental Scientist
CSIRO Land & Water
41 Boggo Road, Dutton Park, QLD 4102, Australia
E nicholas.car@csiro.au M 0477 560 177 P 07 3833 5632




On 9/7/18, 8:47 pm, "dilehn@gmail.com on behalf of David I. Lehn" <dilehn@gmail.com on behalf of dil@lehn.org> wrote:

    On Thu, Jul 5, 2018 at 2:19 PM, Pierre van Houtryve
    <pierre.vanhoutryve@gmail.com> wrote:
    > ...
    > We don't have a page for the project yet, but in short, the goal of the
    > project would be to help institutions such as museums generate Permanent
    > URLs from existing excel sheets (exported to .csv).
    > Theses often contain the whole inventory of the museum, so they're pretty
    > large with maybe 10k to 20k lines.
    > Our tool would take theses sheets as input and generate the .htaccess file,
    > then it'd make a pull request on the w3id GitHub repository for the user.
    >
    > What we need to know before starting to develop this project is if you are
    > okay with this kind of application.
    > Some of us worry that this could be qualified as 'abusing' the repo or
    > 'GitHub' as a whole, what do you think about this?
    >
    
    It's not really "abuse", but it is a bit outside of what we're
    currently doing and may not be as maintainable as w3id maintainers or
    the organizations using it would like.  As mentioned, we do currently
    manually approve things and spot check for issues.  No one will
    actually review thousands of redirection rules!
    
    For the case of something like a museum with thousands of items, I
    would think the URLs have a regular pattern?  It's far easier to
    maintain a simple w3id.org wildcard rule than rules for each item.
    Many of the use cases right now are just mapping some wildcard path to
    a target URL.  That keeps the w3id.org rules very simple and the
    target host can do whatever mapping it needs to with the incoming
    requests.  Without seeing the actual data, it's hard to say what the
    best approach is.
    
    As a general question, why do you want the complexity to be on
    w3id.org rather than on the target servers?
    
    -dave
    
    > ...
    > Le jeu. 5 juil. 2018 à 19:51, David I. Lehn <dil@lehn.org> a écrit :
    >>
    >> On Thu, Jul 5, 2018 at 8:25 AM, Pierre van Houtryve
    >> <pierre.vanhoutryve@gmail.com> wrote:
    >> > Hello,
    >> >
    >> > We're a group of developers tasked with making a tool to automate pull
    >> > requests on the github repo of w3id.org (The requests must still be
    >> > triggered manually by the user, but we handle the interaction with
    >> > GitHub.
    >> > Our job is to make it user-friendly and add features such as import from
    >> > csv)
    >> >
    >> > Before we begin our project, we need to ask you a few question.
    >> >
    >> > First of all, would you accept theses pull requests? The pull requests
    >> > would
    >> > come from the GitHub account of the user, but the body/content of the
    >> > pull
    >> > request would be computer generated.
    >> >
    >> > Also, is there a limit to the size of the .htaccess file, or can they
    >> > get as
    >> > big as the client needs them to be (10-20k lines maybe) ?
    >> >
    >> > Thank you,
    >> >
    >> > The Resolver team of Open Summer of Code 2018.
    >>
    >> Is there more info available on this project?  It's unclear what you
    >> are trying to do.  What use case requires a 10k+ line .htaccess file?
    >> What input would generate that?  A concern is that we basically have a
    >> few people that approve updates by hand at the moment.  Mostly all use
    >> cases are fairly simple at the moment too.  The complex .htaccess
    >> files are mostly due to type negotiation.
    >>
    >> As far as csv input, I think various people have had thoughts on that
    >> sort of direction but no one has fleshed out the ideas.  Many of the
    >> current simple use cases could be put in a csv/json/yaml/toml file
    >> that gets converted to a .htaccess file.  I imagine that sort of thing
    >> would be integrated into w3id setup itself vs something external that
    >> generates PRs.
    >>
    >> -dave
Received on Monday, 9 July 2018 21:02:29 UTC