livez/readyz HTTP status codes from Andrew Timmes on 2025-04-29 (ietf-http-wg@w3.org from April to June 2025)

From: Andrew Timmes <andrew.timmes@gmail.com>
Date: Tue, 29 Apr 2025 14:18:24 -0400
To: ietf-http-wg@w3.org
Message-ID: <CADf83uBgB5JJY-P=EOw+HifaGj6eVMmLRGjDB=8e244OVs8F3w@mail.gmail.com>

Hi folks,

I work on the reliability of distributed systems, which functionally these
days means working on/with/around Kubernetes. K8s microservices allow you
to define readiness and liveness checks for HTTP services, which should
return a 2xx response code to indicate a ready/live state and a non-2xx
response otherwise.

Most implementations I've seen of livez/readyz checks will either time out,
or actively return either a 500 or 503 if a service's internal logic has
determined that a service hasn't met some prerequisite condition. However,
differentiating between "my liveness check returned a 503 status code
because it hasn't finished initializing some dependency" vs. "my liveness
check returned a 503 because there's a proxy in the request path that can't
talk to the upstream service because it's being CPU throttled" is tricky to
determine without digging into additional information (log/response fields,
etc).

Given the ubiquity of k8s (and the concept of livez/readyz checks even in
non-k8s orchestrators), could there be value in standardizing on specific
HTTP status codes for "Not Ready"/"Not Live" to differentiate between
active/passive failures across client libraries? (I imagine there's a high
bar for adding new codes to the standard; I'd be interested to hear what
the criteria are for what would make a new code worth adding to the spec.)

Long-time reader, first-time caller here - apologies for any breaches in
etiquette or protocol, and thanks in advance for any wisdom y'all can
dispense.
-Andrew Timmes

Received on Thursday, 22 May 2025 07:10:07 UTC