Re: what constitutes an "invalid" content-length

I've written two large-scale web crawlers, processing billions of links,
and since my principle was to err on the side of inclusiveness, I totally
ignored Content-length and wrote the necessary code to deal with whatever I
got, including the occasional GET returning an infinite stream of
smiley-emoji or null bytes or whatever.
On Jul 12, 2016 6:36 AM, "Adrien de Croy" <adrien@qbik.com> wrote:

> Hi all
>
> just dealing with a site that sends more payload data than is indicated in
> the Content-Length header.
>
> If the browser connects directly, the page loads fine, if via the proxy,
> the proxy is truncating the length to that advertised and the client isn't
> displaying a page (of course this is the .css file).
>
> RFC7230 sections 3.3.2 (Content-Length), 3.3.3 (Message body length),
> and 3.3.4 (Handling incomplete messages) only contemplate issues around
> Content-Length specifying more bytes than are received, not fewer.
>
> I guess one could argue that a wrong C-L value is "invalid", but it's not
> clear that invalid in this context simply means it doesn't parse, or is
> otherwise non-compliant with the ABNF.
>
> So, it's not clear what the browser and/or proxy response should be.  If
> we deem a wrong value to be "invalid" (s3.3.3 para 4), a client is supposed
> to discard the response.  This isn't happening.
>
> For the proxy, it only sees that the content length is wrong once it
> receives too many bytes.  By this stage, the horse has bolted so it cannot
> really comply either.
>
> I would expect it's in everyone's best interest if sites that have broken
> framing are forced to be fixed.  This won't happen if browsers "just work"
> for the site.
>
> Is there a special behaviour we should agree on for such cases?
>
> Regards
>
> Adrien de Croy
>

Received on Tuesday, 12 July 2016 21:54:17 UTC