Re: livez/readyz HTTP status codes from Lisa Dusseault on 2025-05-22 (ietf-http-wg@w3.org from April to June 2025)

From: Lisa Dusseault <lisa.dusseault@gmail.com>
Date: Thu, 22 May 2025 09:19:59 -0700
To: Andrew Timmes <andrew.timmes@gmail.com>
Cc: ietf-http-wg@w3.org
Message-ID: <CAEi+uC68Svbgi8Yr3kB+6fxnv73kH9b_KF-y5rKts6j7W09OtQ@mail.gmail.com>

Using a error response body is an extensible way of doing this:


   >>Response

      HTTP/1.1 423 Locked
      Content-Type: application/xml; charset="utf-8"
      Content-Length: xxxx

      <?xml version="1.0" encoding="utf-8" ?>
      <D:error xmlns:D="DAV:">
        <D:lock-token-submitted>
          <D:href>/workspace/webdav/</D:href>
        </D:lock-token-submitted>
      </D:error>


I've also deployed many Web service APIs that put "error" dicts in
JSON in a 403 or 500 or 400 response.  They're really great for
debugging (on the service side or on the API user side).  So this
approach can be taken either along with standardizing certain error
identifiers, or without.


Lisa


On Thu, May 22, 2025 at 12:14 AM Andrew Timmes <andrew.timmes@gmail.com>
wrote:

> Hi folks,
>
> I work on the reliability of distributed systems, which functionally these
> days means working on/with/around Kubernetes. K8s microservices allow you
> to define readiness and liveness checks for HTTP services, which should
> return a 2xx response code to indicate a ready/live state and a non-2xx
> response otherwise.
>
> Most implementations I've seen of livez/readyz checks will either time
> out, or actively return either a 500 or 503 if a service's internal logic
> has determined that a service hasn't met some prerequisite condition.
> However, differentiating between "my liveness check returned a 503 status
> code because it hasn't finished initializing some dependency" vs. "my
> liveness check returned a 503 because there's a proxy in the request path
> that can't talk to the upstream service because it's being CPU throttled"
> is tricky to determine without digging into additional information
> (log/response fields, etc).
>
> Given the ubiquity of k8s (and the concept of livez/readyz checks even in
> non-k8s orchestrators), could there be value in standardizing on specific
> HTTP status codes for "Not Ready"/"Not Live" to differentiate between
> active/passive failures across client libraries? (I imagine there's a high
> bar for adding new codes to the standard; I'd be interested to hear what
> the criteria are for what would make a new code worth adding to the spec.)
>
> Long-time reader, first-time caller here - apologies for any breaches in
> etiquette or protocol, and thanks in advance for any wisdom y'all can
> dispense.
> -Andrew Timmes
>

Received on Thursday, 22 May 2025 16:20:15 UTC