Re: stitching together APIs from David Skinner on 2012-10-03 (public-hpcweb@w3.org from October 2012)

From: David Skinner <deskinner@lbl.gov>
Date: Wed, 3 Oct 2012 09:19:25 -0700
To: Rion Dooley <deardooley@gmail.com>
Cc: "public-hpcweb@w3.org" <public-hpcweb@w3.org>
Message-ID: <1456436646571053387@unknownmsgid>
On Oct 3, 2012, at 8:09 AM, Rion Dooley <deardooley@gmail.com> wrote:


On Oct 2, 2012, at 9:06 PM, David Skinner <deskinner@lbl.gov> wrote:



On Tue, Oct 2, 2012 at 1:13 PM, Rion Dooley <deardooley@gmail.com> wrote:

> I agree with David's general point, though I'm wary of issues creeping up
> from different OS impacting the evaluation of the API itself. Perhaps using
> something like CloudFoundry or even a VM with a pretty vanilla stack would
> serve the same purpose and level the playing field some. Also, are we
> assuming the API is running on the HPC system or as a hosted service?
>
>
Sure. There is something I like about the simplicity of the tarball,
configure, make, run all being local but also see your points about client
support. I think it's safe to say that the target platform is ultimately a
Linux node sitting close the batch and filesystems of a big Linux machine.
That's something you can reliable mimic on say an apple laptop or a server
node. Am I missing other platforms that we'd want to support easy test
drives on?

As for hosting, so far the assumption has mostly been that it runs at the
center. Some places like LLNL would have big issues with hosting-out access
to the big iron. I'd let them speak to that but that's my guess. BTW,
GlobusOnline has a hosted approach for data movement, an associated REST
API, etc. My feeling is that for execution and job control the HPC system
based approach has some heavy upsides. What do others think?


I see your point about LLNL and other high security situations, but it's
fair to consider the many centers where admins simply won't allow services
like these to run on their head nodes. Also, are we assuming it runs as
root, or in user space?


So bear in mind the API and the implementation are independent. The
discussion with LLNL HUD far has been fully about he former. For
implementation it will come down to something that runs as root, IMO, but
that may be sshd or gatekeeper. The key is I think not asking HPC center
staff to install something new that runs as root.

I often form opinions too quickly but try to be quick to adjust them too.
The issue with no root seems twofold. 1) a userland process communicating
off some high port will be be hard to make persistent and subject to
network controls firewalls. 2) in the current APIs and implementations
there is significant center-level knowledge about machine names the brand
of batch queue they use etc. Having that config done once and well seems
preferable.



Alternatively we (NERSC, TACC, etc.) could be the cloud test/dev space for
the API. Sufficiently stubbed out it would be hard for people to make
trouble with and it would be a zero-step install. Think like those CMS demo
pages that let you login to test drive. Anything that makes it easier for
people to get a taste of HPC on the web I am all for considering. I suggest
that if we go with Cloudfoundry, NERSC/TACC, or whatever that we step
through what we're asking newcomers to do in order to try it out. This will
appeal both the application/user folks who are interested in HPC web
interface options as well as facilties/center people who are going to
evaluate whether they can live with it.


One of the benefits of running on cloud foundry or any other PaaS would be
that the API services could be deployed to any "cloud" as well as the
desktop.


Tell me more. I may not get this part. If the tarball we produce is solid
like gzip (configure; make) it could go on any  laptop, server, cloud, or
HPC system that is unixy.


One of the icebergs that sunk the grid ship was difficultly in getting
software and services up and running. Highly layered middleware, hard
installs, pages of XML configs, etc. I am against all that stuff.


Amen. So at the very least we'll need a clean web app with a simple
wizard/form to configure the scheduler, account types, auth mechanism,
default data protocol and file system, admin accounts, and some performance
preferences. What else have i missed?


As for names, NERSC Web Toolkit has no great value attached to it. We
discussed with LLNL and others that it could be the Nice and Easy Web
Toolkit, or whatever else. Again if the name can be changed to make the API
non-territorial in order to increase adoption I am all for it. I have no
idea what it means but I think Globus was a good choice. Big. Round.
Inclusive.


The auth, data, jobs, and metadata services seem to be a good starting
> place. We might also want some information services such as system
> discovery and monitoring. Given that this is meant to drive web apps…and
> hopefully future ones, perhaps supporting event and pub/sub services would
> also be helpful. Lastly, is the api in charge of monitoring itself or are
> we assuming that's a production detail the centers would implement
> themselves? One of the things we've done with AGAVE is provide both real
> time and historical uptime reports for our users. This service is deployed
> outside the api, and lets us know the ongoing stability of our services and
> the systems and services we depend on. We find that it also helps build
> trust with our users. I'm not sure that this service is really in the scope
> of the API, but it's one of those things that, until we had it, we always
> missed, but never knew it. What are other people's thoughts on this?
>
>
I'd go for monitoring as a core topic. Job status already is (GET on jobid)
as are queue monitoring (GET on system). Monitoring is something my group
does a lot of (app perf, FS perf, power/env monitoring, etc.) so I know
that scope creep is very possible here. What about these topics?


I suppose this depends on what the purpose of the api is. If it's a target
for developers to build apps that end users to access the resources and
conduct science from the web, then providing sysadmin tools might be
overkill. I'm not sure our next chemistry gateway would see much value from
having access to power consumption stats on each job. It would make for
some great visualizations, though. The question seems to be who is our
target audience.


I do caution against creep in monitoring topics. So while I laid out some
topics they are not high priority.  Getting power numbers is vey hard. We
at NERSC have 0.5 FTE working on that and it's very custom to the facility.
At the moment it gets easy though I would go for it. Energy metrics are the
next big thing that centers will address in performance and charging.


system monitoring:  uptime , core count, #people logged in, date of
deployment, pub/sub on outages to steer workflow automation (back off when
outage is announced)


This all seems like good info. Do we envision supporting single or
multi-tenancy? If running on the HPC system, do we foresee API sessions
being tied to system sessions?


What's a system session? I see an API session as tied to a auth/deauth tied
to the base URL. For cross site and other reasons the session would be tied
to the server.


self monitoring : introspection on the number of sessions recent API
activity, could be admin only.



FS monitoring : df for the web

data transfer monitoring : maybe this is GO's territory not ours?


There are some significant auth challenges to doing this in the general
case and if we want to ship something as a deployed solution, we need to
support sftp, ftp, fops, gridftp, and irods out of the box so users can
access data how they see fit.


So are you thinking third party transfers? Our model currently is GET/POST
file data from/to a host delivered to the browser by means unknown to the
client. The data movement between HPC and server is pluggable.


...one more topic...

At the risk of bloating the set of tasks ahead I am leaning towards the
notion that task queues may also be a core concept. That gets us into the
wild and woolly space of workflows, but relying on the HPC batch queue
system to delineate a set of steps to be done is failing IMO, at our site
at least. They don't scale and their latency is too high. There are big
wins for providing assistance to science teams who have 10^6 "things" they
need to do "M at a time" and currently have no great solutions except
writing their own control loops. So while i see a pressing need there, I am
not 100% that NEWT/AGAVE etc. is the right place for it.



Were you thinking of pulling from an existing workflow project to implement
this? Do you have a preference?


Not a strong preference but interested in scalability. Image processing
workloads are starting to look more like quant trading. Things like Rabbit
and Zero MQ seem to have this addressed. I would steer clear of complex
workflow solutions and stick to managing steps at scale. As always looking
frequently at "what is the problem we are trying to solve"

AGAVE is all queue driven on the backend, so if you wanted to reuse the
preprocessing mechanism from our IO and Data services that chain together a
series of data transforms to process data as it passes in and out of our
Data Store, I'm happy to done the code. While I don't think it's exactly
what you're describing, it gets us part of the way there and it's built off
the Quartz framework, so the technology is well established. It might be a
starting point. Another option might be looking at Airavata, Apache ODE,
Spring, jBPM, etc given they already have well-defined mechanisms for doing
this.


I will need to read up on these. Not a strong area for me.

Can we invite some TACC folks out here for a few days of discussion?  We
could go to TACC as well/instead. SC is basically a no-go this year for DOE.

What kind of interoperability are we targeting with these services? Are we
striving for standard near-compliance (given that full probably isn't
reasonable), or usability?


Usable near-compliance? Not sure really. So far it's mostly been "let'stry
to make similar URLs". Part of the motivation for the W3C group is to
broaden that discussion and bring some more web standard expertise to bear
on the choices we're making. The basic grammar and ordering of the strings
of the URLs we use being one such case.

 I think if we can start sharing code and development, if that's what
people want, it will bring more rigor to that sentiment.

-David

Let's keep the conversation going. I am available to chat pretty much
anytime at 510-486-4748 if you have ideas or what I said here is unclear.

Cheers,

David

https://foundation.iplantcollaborative.org/monitor/history
>
>  --
> Rion
> @deardooley
>
> On Oct 2, 2012, at 1:24 PM, David Skinner <deskinner@lbl.gov> wrote:
>
>
>
> On Tue, Oct 2, 2012 at 11:11 AM, Annette Greiner <amgreiner@lbl.gov>wrote:
>
>> Hi folks,
>> To frame the discussion for the October 11 conference call, I've started
>> thinking about how to go about putting together a first draft of a standard
>> API. It seems to me that it would be logical to simply blend the two APIs
>> we currently have, NEWT and the iPlant API (Agave). There's a lot they have
>> in common, though of course they have different terms for things. I would
>> suggest we choose our terms based on three principles:
>> coherence: terms in the API should have grammatical commonality with
>> other terms of similar function in the API
>> clarity: terms should be unambiguous
>> memorability: terms should be easy to associate mentally with their
>> meaning in the API
>> cross-center generalizability: terms should make sense in the context of
>> any HPC center
>>
>>
> Good points. One step toward the last one is to make a fake HPC center
> stubbed out in the software itself. This serves two purposes. 1) you get to
> try the software or develop on your laptop without touching the guts of
> your HPC center. 2) It provides a common meeting ground for all of us as a
> plain vanilla idealization of an HPC center. To be a little more specific I
> am suggesting that auth, data, and job functions should have stub
> implementations that operate locally and while ineffectual they should be
> processed in a way that mimics a real HPC center.
>
> auth: just use an install-time configured password with a test user
> data: just move local files on disk
> jobs: just run the command (fork/exec).
> KVP store: use a couch or mongo local instance.
>
> Once we have that stub implementation down and packaged people can
> download and try the API without herculean efforts.
>
> We'll also need to discuss the scope of the standard API. How much should
>> it cover? Clearly, centers should be free to do their own implementations;
>> we are just defining a set of REST calls that can be re-used across
>> implementations. But what functions should be left out of the standard? I'm
>> thinking here of functions that are not specific to HPC. One example is the
>> iPlant PostIt, which generates disposable URLs. I think that's a great
>> service to offer people, but I would suggest we leave it out of a standard
>> for HPC, since it isn't a function that arises from the HPC context. The
>> iPlant Apps and Profile features strike me similarly. NEWT has a liststore
>> feature that could also be seen as a non-HPC aspect of that API.
>>
>>
> The guiding model for NEWT thus far has been to stick to the core things
> you see in HPC center documentation. How do I log in, how do I move files,
> how do I run things. We don't need to be rigid about that but having a
> guiding principle with a decent level of simplicity seems prudent.
>
> We've also advocated an exception mechanism whereby you can step outside
> the API and do whatever you like. That provides some demarcation as to
> where the API stops and where custom machinery begins.
>
> -David
>
> What do other people think? How should we define what is in/out of the
>> spec?
>> -Annette
>> --
>> Annette Greiner
>> Outreach, Software, and Programming Group
>> NERSC, LBNL
>> amgreiner@lbl.gov
>> 510-495-2935
>>
>>
>>
>>
>>
>>
>
>
Received on Wednesday, 3 October 2012 16:20:25 UTC