Re: Appropriate use of HTTP status codes for application health checks

On 23/02/2017 1:54 p.m., matt wrote:
> Hello,
> 
> 
> 
> My colleagues and I are involved in a debate about the proper usage
> of HTTP return codes for application health pages.
> 
> 
> 
> For instance, you have a /health page that returns JSON listing your
> application’s dependencies as either “Up” or “Down”
> 

The action being performed as far as HTTP semantics are concerned is not
a health check - it is simply "fetch".

As such the status code refers to the "JSON file" thing not its
contents. HTTP does not care what that "JSON file" thing means to your
application, its just some opaque bytes to be located and delivered.


> 
> Some suggest that it is acceptable for your /health page to return an
> unassigned 5xx or 503 if the /health page returns successfully, but

HTTP defines that all unknown 5xx codes are equivalent to 500 status.

For HTTP agents outside your application that usually means a retry with
different server is required, until a 2xx/3xx status is found or no
alternative servers can be identified.

So a 5xx status "working" in terms of your application health check is
actually an excentional event depending on a narrow set of circumstances;
 no other server/IPs available, and
 no alternative routes across the network to reach it.


> the page results indicate the application is not healthy. Spring Boot
> <https://github.com/spring-projects/spring-boot/wiki/Spring-Boot-1.1-Release-Notes#healthindicators>
> has done this. Although I have reservations about 503 since your
> request for the page was handled successfully.
> 

Rightly so.


> Other contend that your /health page should always return a 200
> regardless of whether the page results is indicative of application
> health or not.
> 
> As a layman I can see the argument for both sides, and it seems both
> practices have been used in the past. I perused the RFCs but I don’t
> feel like I found the ‘silver bullet’ answer on this.
> 

The "always 200" is a bit strict. All 2xx and 3xx status mean successful
*fetch*, with various grades of meaning to that success.


IMHO a better efficient way for a polling system is to use 204 as "All
okay", and 200 as "some problem(s)". No bandwidth wasted with payload on
the common Up status, and ability to deliver details about the outage on
the Down status.

Amos

Received on Thursday, 23 February 2017 09:53:55 UTC