Re: Appropriate use of HTTP status codes for application health checks from Willy Tarreau on 2017-02-27 (ietf-http-wg@w3.org from January to March 2017)

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 27 Feb 2017 07:19:37 +0100
To: Amos Jeffries <squid3@treenet.co.nz>
Cc: ietf-http-wg@w3.org
Message-ID: <20170227061937.GA5797@1wt.eu>

On Mon, Feb 27, 2017 at 05:38:49PM +1300, Amos Jeffries wrote:
> On 23/02/2017 11:24 p.m., Willy Tarreau wrote:
> > Hi Amos,
> > 
> > On Thu, Feb 23, 2017 at 10:53:07PM +1300, Amos Jeffries wrote:
> >> IMHO a better efficient way for a polling system is to use 204 as "All
> >> okay", and 200 as "some problem(s)". No bandwidth wasted with payload on
> >> the common Up status, and ability to deliver details about the outage on
> >> the Down status.
> > 
> > In fact it's common to see health check applications return 5xx for a
> > very simple reason, the front equipment performing the check (often a
> > load balancer) has to deal with these situations anyway, and most use
> > cases just want to return "completely up" or "completely dead". But I
> > agree that when you want to support the gray area in between, it's much
> > better to support intermediary codes. FWIW haproxy also supports a
> > special case of 404 to mean "closing soon, no more requests please" so
> > that admins can simply touch/rm a file in a docroot. That's just to say
> > that there are many valid use cases and tha common sense adapted to what
> > components *reliably* support is often the best here.
> > 
> 
> For an individual health-check you are right. But that is not the
> use-case matt has.
> 
> The use-case in question is for the response coming from some aggregator
> process, which uses health-checks as its input/data. One status code
> summarizing the situation of N endpoints.  No 4xx or 5xx is going to be
> adequate for that, simply because of what the 400 and 500 defaults mean
> to the general HTTP ecosystem.

I totally get your point but I see a big difference between what would
be perfect and what components can do. For example for over a decade
haproxy was not able to consider anything but a status code, and because
of this there have been many people who implemented 500 as a response to
aggregated tests just for this (now it's more flexible). And I've had to
deal with other products which could only use this as well.

Also, even for an aggregated test, you may end up with real 5xx errors
because of timeouts or failure to deal with unexpected responses, so
the LB still has to deal with this case normally.

So I'd summarize it like this when seen from the front component :

   - 200 => status is OK
   - <something> => status is faulty (partially or totally)
   - 5xx => a technical error appeared during the processing

Given the 5xx has to be dealt with, if there is no need for a clear
distinction between a failure in the health check component and a
faulty test, the 5xx will work fine. If it's needed to make a
distinction (eg: all responses are logged), then something else
would be better (including some 2xx as you proposed).

Cheers,
willy

Received on Monday, 27 February 2017 06:20:48 UTC