Re: Misc review notes for draft-18 p1 from Willy Tarreau on 2012-01-26 (ietf-http-wg@w3.org from January to March 2012)

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 26 Jan 2012 20:36:20 +0100
To: Julian Reschke <julian.reschke@gmx.de>
Cc: ietf-http-wg@w3.org
Message-ID: <20120126193619.GA13716@1wt.eu>
Hi Julian,

On Thu, Jan 26, 2012 at 07:15:05PM +0100, Julian Reschke wrote:
> >2.1. Client/Server Messaging, page 11
> >
> >>   Note that 1xx responses (Section 7.1 of [Part2]) are not final;
> >>   therefore, a server can send zero or more 1xx responses, followed by
> >>   exactly one final response (with any other status code).
> >
> >This parts falls here quite out of context in my opinion. Neither
> >responses nor status core nor messaging has been defined yet and all
> >of a sudden we get this. I suggest we move this to P2 7.1 and replace
> >it with a small note such as :
> >
> >   Note that sometimes a server may send multiple responses, see Section
> >   7.1 of [Part2] for more details about interim responses.
> 
> We did that totally on purpose, see 
> <http://trac.tools.ietf.org/wg/httpbis/trac/ticket/300>.

OK, but don't you think that moving the same text to its place and having
just an explicit reference as I proposed would satisfy the ticket ? The
text itself is fine, it's just that when I read the beginning of the spec,
it suddenly speaks a language that was not even introduced earlier, which
makes one think this was added late or is a leftover of some old text.


> >2.4. Intermediaries, page 13
> >
> >Context :
> >>            >              >              >              >
> >>       UA =========== A =========== B =========== C =========== O
> >>                  <              <              <              <
> >...
> >
> >>   For example, B might be receiving
> >>   requests from many clients other than A, and/or forwarding requests
> >>   to servers other than C, at the same time that it is handling A's
> >>   request.
> >
> >I'd underline that there is no single path between a UA and an 
> >intermediary,
> >and that sometimes direct and indirect communications are possible. It 
> >helps
> >remind people that rewriting URLs along the path is not always a good idea.
> >I'd suggest this then :
> >
> >     For example, B might be receiving requests from many clients other 
> >     than A
> >     including UA/C/O, and/or forwarding requests to servers other than C, 
> >     at
> >     the same time that it is handling A's request.
> 
> UA I see, but C and O?

Both are common cases when B is a load balancer and C is an accelerator
(eg: cache). Servers are also clients of other servers in many places,
and then they reach them via the LB. For the same reason, some caches
will reach some origin server farms via the LB on misses.

> >2.7.1. http URI scheme
> >
> >>    If the host identifier is provided as an IP literal or IPv4 address,
> >
> >I did not find a clear definition of the term "IP literal". Also, does it
> >cover the bracketed format of IPv6 ?
> 
> I think we need to ref 
> <http://greenbytes.de/tech/webdav/rfc3986.html#rfc.section.3.2.2> here.

Ah perfect, thanks for the link, indeed I guessed right.

(...)
> >>   When a server listening only for HTTP request messages, or processing
> >>   what appears from the start-line to be an HTTP request message,
> >>   receives a sequence of octets that does not match the HTTP-message
> >
> >Wouldn't "does not *exactly* match" be better ? I'm used to find
> >crappy requests in my logs which are blocked but which some not-so-lazy
> >implementations would let pass (eg: multiple SP).
> 
> "match" means "match"; I don't think there's any ambiguity here...

There's no ambiguity, it's just to emphasize on the need to perform
strict matching. A large number of HTTP parsers are much too lazy,
causing nightmares when trying to filter undesired communications,
or even to define new protocol extensions. For instance on my old
Apache 1.3 here :

    $ telnet www 60080
    Connected to www.
    Escape character is '^]'.
    HEAD     /           HTTP/1.1     ergeargoaejgoiejgaoeg
    Host:   ,,,,
    Invalid/header name: blah
    
    HTTP/1.1 200 OK
    Date: Thu, 26 Jan 2012 19:07:02 GMT
    Server: Apache
    Last-Modified: Mon, 01 Jun 2009 16:47:12 GMT
    ETag: "47038-3ad7-46b4c2d81a400"
    Accept-Ranges: bytes
    Content-Length: 15063
    Connection: close
    Content-Type: text/html
    
    Connection closed by foreign host.

"SP" is *one* SP, still multiple SPs are accepted in the request
line. Same for forbidden chars in the header name. And I'm not
specifically targeting Apache here, I just took the first example
I had handy, it's far from being alone. It looks like strchr(),
strtok(), sscanf() or split() depending on the language and
implementation are common ways to parse requests. This is part
of what caused all the mess in the hybi WG, delaying it by one
year trying to find solutions against various implementations.


> >>   grammar aside from the robustness exceptions listed above, the server
> >>   MUST respond with an HTTP/1.1 400 (Bad Request) response.
> >
> >I would also suggest that clients and proxies protect themselves against
> >malformed response messages, which are problematic in shared hosting
> >environments. This could be summarized like this :
> >
> >     In general, any agent which receives a malformed message MUST NOT try
> >     to fix it if there is any possibility that any other implementation
> >     along the chain understands it differently. In such conditions, the
> >     message MUST be rejected.
> 
> -0.5.
> 
> - it's a requirement hard to test for, and
> 
> - it's not going to be implemented by browsers.

I agree on your last point, so maybe we could refine this to target
intermediaries only. The reason is that (as an implementor) I'm
regularly asked to implement ugly tricks to help fix issues that
are caused by application bugs. Among the funny requests "could you
please fix haproxy to ignore multiple empty lines after the headers"
or "it would help me if haproxy could only focus on the first
occurrence of header X or Y and ignore all other ones because
a bug in my application repeats them with empty contents".

I've always rejected such dangerous crap and only accepted to
offer options to temporarily relax some checks but I think that
all the web would benefit from a reminder on these dangerous
practices which may lead to smuggling attacks.


> >4.1. Types of Request Target
> >
> >>Note: The "no rewrite" rule prevents the proxy from changing the
> >
> >I did not find reference to this "no rewrite" rule.
> 
> It's the rule above the note.
>
> -> <http://trac.tools.ietf.org/wg/httpbis/trac/changeset/1517>

Ah OK thanks for the fix.

(...)
> >Rule 3 might be difficult to apply in massively hosted environments, as
> >I easily imagine that there could be a large "vhosts" directory with
> >all the hosts roots presented by their names there. The server would
> >then simply try to "cd $host" to check for the host's validity, which
> >might seem appropriate at first. But using a host of ".." or a host
> >containing a slash would have dramatic effects.
> >
> >I don't know what recommendation we could add here because we can't
> >add boring long sentences, but avoiding such simple traps would be
> >nice. Maybe we should just add :
> >
> >     For instance, a host should never be ".." nor contain a slash.
> 
> Are those allowed in a host name anyway?

Should not but the characters are allowed, so the name at least
matches the basic regex '[a-zA-Z0-9.-]+' that simple implementations
would at least check. I'm not saying I've seen that in field, but I
see how easy it is to get trapped. Perhaps something like this might
be enough :

    Note: multi-homed web servers are encouraged to perform very strict
          checks on the host name syntax and validity, and not simply
          rely on the existence of a directory name to accept it.


> >8.4. TE
> >
> >>   The presence of the keyword "trailers" indicates that the client is
> >>   willing to accept trailer fields in a chunked transfer-coding, as
> >
> >Is it only limited to the client ? Nowhere it's said that a server cannot
> >advertise "TE: trailers" in responses so that a client knows it can emit
> >chunked-encoded messages with trailers in further requests (eg: backups
> >with SHA1 at the end). Replace "client" with "sender" maybe ?
> 
> We seem to be confused about who can set TE anyway:
> 
> "The "TE" header field indicates what extension transfer-codings it is 
> willing to accept in the response, and whether or not it is willing to 
> accept trailer fields in a chunked transfer-coding."
> 
> We need to state who "it" is...

Yes I noted this one too and forgot to report it. How about this :

   A client may use the "TE" header field in a request to indicate what
   extension transfer-codings it is willing to accept in the response, and
   whether or not it is willing to accept trailer fields in a chunked
   transfer-coding.

   The "TE" header's value consists ...

> >A.1.2 Keep-Alive Connections
(...)
> >-   the client ought stop sending the header),
> >+   the client ought stop using this header in further communications with
> >+   the server),
> 
> "...ought to stop using this header field in further ..."?

Perfect.

> >...
> >That's all for me now, I'll probably have other comments later.
> >...
> 
> Thanks a lot for that; I tried to comment where I had some confidence on 
> the resolution.
> 
> We probably need to figure out a way to manage the feedback better; 
> maybe recommend sending smaller chunks with meaningful subject lines, so 
> threading works properly?

That's what I intended to do first, because I only had minor issues
that could be grouped together, but I'm talking too much. I was planning
on reporting complex ones in individual mails as well as at least split
the mails by part.

Probably that in a few days I'll resend the parts without any response
in individual mails.

Thanks,
Willy
Received on Thursday, 26 January 2012 19:37:01 UTC