- From: Andrew Timmes <andrew.timmes@gmail.com>
- Date: Tue, 29 Apr 2025 14:18:24 -0400
- To: ietf-http-wg@w3.org
- Message-ID: <CADf83uBgB5JJY-P=EOw+HifaGj6eVMmLRGjDB=8e244OVs8F3w@mail.gmail.com>
Hi folks, I work on the reliability of distributed systems, which functionally these days means working on/with/around Kubernetes. K8s microservices allow you to define readiness and liveness checks for HTTP services, which should return a 2xx response code to indicate a ready/live state and a non-2xx response otherwise. Most implementations I've seen of livez/readyz checks will either time out, or actively return either a 500 or 503 if a service's internal logic has determined that a service hasn't met some prerequisite condition. However, differentiating between "my liveness check returned a 503 status code because it hasn't finished initializing some dependency" vs. "my liveness check returned a 503 because there's a proxy in the request path that can't talk to the upstream service because it's being CPU throttled" is tricky to determine without digging into additional information (log/response fields, etc). Given the ubiquity of k8s (and the concept of livez/readyz checks even in non-k8s orchestrators), could there be value in standardizing on specific HTTP status codes for "Not Ready"/"Not Live" to differentiate between active/passive failures across client libraries? (I imagine there's a high bar for adding new codes to the standard; I'd be interested to hear what the criteria are for what would make a new code worth adding to the spec.) Long-time reader, first-time caller here - apologies for any breaches in etiquette or protocol, and thanks in advance for any wisdom y'all can dispense. -Andrew Timmes
Received on Thursday, 22 May 2025 07:10:07 UTC