Question about characters in HTTP headers

In RFC 2616, we have:

       CHAR           = <any US-ASCII character (octets 0 - 127)>
       TEXT           = <any OCTET except CTLs, but including LWS>
       quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
       qdtext         = <any TEXT except <">>
       quoted-pair    = "\" CHAR

I have three questions:

1.  This leads to the curious observation that octets 128 - 255 are
    _valid_ in comments, text, quoted strings and so forth.  But they are
    _not valid_ after "\" inside a quoted-string.  (They are valid after
    "\" inside comments!)

    Is this intentional, that octets 128 - 255 are allowed in text,
    including inside quoted-string, and allowed after "\" in comments
    but not in quoted-string?

2. Control chars (those in CTL) are permitted by the syntax after "\"
   in quoted-string.  It seems odd to allow control chars in the
   headers at all.  (It's even odder to allow ASCII control chars but
   refuse octets 128 - 255 after "\" in qdtext).  Is this intentional?

3. Although other ASCII control chars are permitted after "\", a lone
   CR is not allowed.  HTTP client/server code I have looked at in detail
   (Apache, Squid, Mozilla) accepts lone CRs and treats them as LWS
   in many contexts, albeit inconsistently.   token).

   Would it not make sense to formalise this, even if it's just in the
   "Tolerant Applications" section?  Then the rule for accepting LF
   without CR could be simplified: tolerant applications might treat
   LF as the line terminator, and CR as equivalent to LWS (some real ones
   do that).

   Alternatively it could be made a SHOULD or even MUST that programs
   reject lone CRs, because of security implications: some proxies treat

      "Authorization" <CR> ":"

   as a header different from Authorization, and don't apply the rules
   for proxies when this header is present, yet pass it on to origin
   servers which then (non-compliantly) interpret the header as
   equivalent to Authorization.  It would be good to indicate that
   programs should not accept messages containing embedded CRs like that.

   This is implied by the grammar, yet every program I looked at
   accepts embedded lone CRs without complaint, and may or may not
   treat them as LWS in various contexts.  Apache is interesting in
   that it treats CR as LWS-equivalent nearly everywhere, but not
   between the header name and ":", where it only allows SP and HT.

-- Jamie

Received on Monday, 15 March 2004 12:42:23 UTC