Re: Question on HTTP 408 from Michael Sweet on 2014-06-04 (ietf-http-wg@w3.org from April to June 2014)

From: Michael Sweet <msweet@apple.com>
Date: Wed, 04 Jun 2014 17:00:52 -0400
To: Willy Tarreau <w@1wt.eu>
Cc: "William Chan (?????????)" <willchan@chromium.org>, HTTP Working Group <ietf-http-wg@w3.org>, Matt Menke <mmenke@chromium.org>
Message-id: <4E1378B3-F2EB-4047-BBD9-819696724138@apple.com>
Willy,

I can go along with a 408 (and only a 408) response being sent by a server right before a connection is torn down due to inactivity.  Document this as "the server MAY send a 408 response prior to closing a connection and a client SHOULD use the presence of a 408 response as an indication that the request should be retried" or something along those lines, with the discussion below distilled to the critical bits.  We should also be clear that this problem is most pronounced when using small inactivity timeouts.


On Jun 4, 2014, at 4:01 PM, Willy Tarreau <w@1wt.eu> wrote:

> Hey William,
> 
> On Wed, Jun 04, 2014 at 12:50:59PM -0700, William Chan (?????????) wrote:
>> Thanks everyone for chiming in. At this point, I think I should make some
>> clarifying points and inject / re-emphasize some personal opinion here. You
>> can probably ignore this email unless you're interested in all the details.
>> 
>> Clarifications:
>> 1) There is fundamentally a race between the client sending a request and
>> the server closing its end of the connection, due to the network latency.
>> 2) If the client does not read from the socket before/during sending a
>> request, then the time gap for the "race" may include not only the network
>> latency, but also the time until the client transmits the request and
>> starts reading from the socket.
>> 3) When the client receives FIN or a socket error when reading from the
>> socket, it usually cannot tell the root cause (server timing out the
>> connection, server application crash, etc).
>> 4) Clients have different heuristics for retrying on such errors. Here's
>> Chromium's (
>> https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_network_transaction.cc&q=http_network_transaction.cc&sq=package:chromium&l=1479
>> ):
>> 
>> bool HttpNetworkTransaction::ShouldResendRequest() const {  bool
>> connection_is_proven = stream_->IsConnectionReused();  bool
>> has_received_headers = GetResponseHeaders() != NULL;  // NOTE: we
>> resend a request only if we reused a keep-alive connection.  // This
>> automatically prevents an infinite resend loop because we'll run  //
>> out of the cached keep-alive connections eventually.  if
>> (connection_is_proven && !has_received_headers)    return true;
>> return false;}
>> 
>> 
>> As one can see, we opt not to retry on an unused connection
>> (HttpStream::IsConnectionReused() is actually misnamed, it also includes
>> sockets that were ever idle, like our preconnected sockets). This is
>> because the HTTP request may have caused a server error (like the server
>> application to crash) and it would not be nice to re-send that request.
>> However, for idle HTTP connections, there's a reasonable likelihood due to
>> the idleness that we hit one of those races, so we will retry the
>> transaction on a new connection. I guess I should show the situation that
>> leads us to call this function...which is on relevant socket errors (
>> https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_network_transaction.cc&q=http_network_transaction.cc&sq=package:chromium&l=1411
>> ):
>> 
>> int HttpNetworkTransaction::HandleIOError(int error) {  // Because the
>> peer may request renegotiation with client authentication at  // any
>> time, check and handle client authentication errors.
>> HandleClientAuthError(error);  switch (error) {    // If we try to
>> reuse a connection that the server is in the process of    // closing,
>> we may end up successfully writing out our request (or a    // portion
>> of our request) only to find a connection error when we try to    //
>> read from (or finish writing to) the socket.    case
>> ERR_CONNECTION_RESET:    case ERR_CONNECTION_CLOSED:    case
>> ERR_CONNECTION_ABORTED:    // There can be a race between the socket
>> pool checking checking whether a    // socket is still connected,
>> receiving the FIN, and sending/reading data    // on a reused socket.
>> If we receive the FIN between the connectedness    // check and
>> writing/reading from the socket, we may first learn the socket    //
>> is disconnected when we get a ERR_SOCKET_NOT_CONNECTED.  This will
>> most    // likely happen when trying to retrieve its IP address.    //
>> See http://crbug.com/105824 for more details.    case
>> ERR_SOCKET_NOT_CONNECTED:    // If a socket is closed on its initial
>> request, HttpStreamParser returns    // ERR_EMPTY_RESPONSE. This may
>> still be close/reuse race if the socket was    // preconnected but
>> failed to be used before the server timed it out.    case
>> ERR_EMPTY_RESPONSE:      if (ShouldResendRequest()) {
>> net_log_.AddEventWithNetErrorCode(
>> NetLog::TYPE_HTTP_TRANSACTION_RESTART_AFTER_ERROR, error);
>> ResetConnectionAndRequestForResend();        error = OK;      }
>> break;
>> 
>> 
>> 5) HTTP is typically request/response, so Chromium makes the assumption
>> that it is a server bug if there is data in the socket before we try to use
>> it, and will discard the socket and try to find another to use. However,
>> recently we discovered that for our preconnected sockets and SPDY / HTTP/2,
>> the server always wrote the SETTINGS frame, therefore we weren't leveraging
>> preconnect properly for SPDY / HTTP/2. So we *recently* decided to allow
>> sockets with readable data if the socket hadn't been used yet (
>> https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/client_socket_pool_base.cc&q=client_socket_pool_base.cc&sq=package:chromium&l=651
>> ):
>> 
>> bool ClientSocketPoolBaseHelper
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/client_socket_pool_base.h&cl=GROK&ct=xref_jump_to_def&l=156&gsn=ClientSocketPoolBaseHelper>::IdleSocket
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/client_socket_pool_base.h&cl=GROK&ct=xref_jump_to_def&l=339&gsn=IdleSocket>::IsUsable
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/client_socket_pool_base.cc&ct=xref_usages&gs=cpp:net::internal::class-ClientSocketPoolBaseHelper::class-IdleSocket::IsUsable()-const@chromium/../../net/socket/client_socket_pool_base.cc%257Cdef&l=651&gsn=IsUsable>()
>> const {  if (socket
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/client_socket_pool_base.h&cl=GROK&ct=xref_jump_to_def&l=357&gsn=socket>->WasEverUsed
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/stream_socket.h&cl=GROK&ct=xref_jump_to_def&l=76&gsn=WasEverUsed>())
>>   return socket
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/client_socket_pool_base.h&cl=GROK&ct=xref_jump_to_def&l=357&gsn=socket>->IsConnectedAndIdle
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/stream_socket.h&cl=GROK&ct=xref_jump_to_def&l=54&gsn=IsConnectedAndIdle>();
>> return socket <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/client_socket_pool_base.h&cl=GROK&ct=xref_jump_to_def&l=357&gsn=socket>->IsConnected
>> <https://code.google.com/p/chromium/codesearch#chromium/src/net/socket/stream_socket.h&cl=GROK&ct=xref_jump_to_def&l=49&gsn=IsConnected>();}
>> 
>> 
>> Previously, we always checked IsConnectedAndIdle(), but now we just check
>> IsConnected() for unused sockets. One could imagine tightening this check
>> so that we only allowed readable data if the socket was detected to be a
>> SPDY / HTTP/2 connection.
>> 
>> 6) Armed with the previous knowledge, we can see what happens when Chromium
>> did *not* support 408s, and when 408 responses are sent by the server
>> *before* any request is sent by the client. With the recent change to allow
>> using unused sockets that have readable data, we passed on these sockets
>> (with 408 response data) to the HTTP transaction code. Since Chromium did
>> not support 408s, this led to 408s being treated as transaction errors, not
>> socket IO errors, so we did not retry these transactions. Due to the recent
>> change as described in (5), this meant that users saw more errors when
>> browsing websites with servers that sent these 408 responses on idle
>> connection timeouts. Chromium (still unreleased in Google Chrome release
>> channels) is supporting 408s now (
>> https://src.chromium.org/viewvc/chrome?revision=274760&view=revision).
>> 
>> Personal opinion:
>> 1) Sending unprompted 408 responses when gracefully terminating a
>> connection clearly does provide a signal for clients to explicitly know
>> it's safe to retry. Explicit knowledge of retry safety is a good thing.
>> 2) Sending unprompted HTTP responses is...very weird. This is non-obvious
>> to most HTTP implementers. Judging from the previous responses so far on
>> this thread, I think many people agree with this analysis.
>> 
>> That said, I personally think (1) is more important than (2) here. Yes, (2)
>> is ugly. But (1) makes life better for users, so deal with it? In my
>> opinion, the spec should explicitly acknowledge this case for 408s. I don't
>> really care to argue what the original intent of 408 was, I care about
>> coming to a consensus on how we should treat it and making it clear in the
>> spec.
> 
> Thanks for the in-depth analysis of the reasons explaining why there was
> a recent increase in this report, especially when Chrome+pre-connect+haproxy
> were combined.
> 
> I agree with the points you made. I'd go further, do you think we could
> improve the client's ability to safely retry if a 408 was also emitted
> on persistent connections *after* a request/response was already served ?
> Haproxy explicitly disables sending of 408 here and I've not seen a server
> emitting a 408 in this condition either. I've read that Firefox is already
> fine with this, and apparently now Chromium is OK with it. But we need to
> know if we're improving the client's ability to safely reuse the connection
> or if there is no chance to gain anything (eg: due to the vast majority of
> deployed servers not doing it anyway).
> 
> I agree we need to find a consensus and better document these corner cases.
> Most commonly deployed clients, servers and intermediaries are on this list,
> so it should not be too hard to know what to change where :-)
> 
> Best regards,
> Willy
> 
> 

_________________________________________________________
Michael Sweet, Senior Printing System Engineer, PWG Chair
Attachments

application/pkcs7-signature attachment: smime.p7s
Received on Wednesday, 4 June 2014 21:01:25 UTC