Re: HTTP HashCash from Willy Tarreau on 2025-05-23 (ietf-http-wg@w3.org from April to June 2025)

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 23 May 2025 08:27:36 +0200
To: Melvin Carvalho <melvincarvalho@gmail.com>
Cc: Demi Marie Obenour <demiobenour@gmail.com>, Watson Ladd <watsonbladd@gmail.com>, ietf-http-wg@w3.org
Message-ID: <20250523062736.GA7652@1wt.eu>
On Fri, May 23, 2025 at 04:34:32AM +0200, Melvin Carvalho wrote:
> > That said, I do think there is a significant problem with building proof
> > of work into HTTP, and that is that while it forces clients to consume
> > additional CPU time, it does *not* force them to run a full-fledged
> > browser.
> > Scrapers strongly prefer to *not* run full browsers, as running a full
> > browser significantly increases memory requirements.
> > See
> > https://old.reddit.com/r/selfhosted/comments/1jy6mug/fail2ban_400_sendmail_blocks_in_12_hours/mmxxd1v/
> > for where I got this information.
> >
> 
> 
> It might look like this for a stateless version (as opposed to
> challenge/response)
> 
> Header: X-Hashcash
> 
> Fields:
> 
> - ver: Hashcash version (currently "1")
> 
> - ts: Timestamp truncated to the current minute (UTC ISO8601)
> 
> - method: HTTP method (e.g., GET)
> 
> - path: Request target path (e.g., /api/data)
> 
> - origin: Hostname of the server (e.g., example.com)
> 
> - nonce: Random client-generated value
> 
> - suffix: Value incremented by client to meet PoW difficulty
> 
> - algo: Hash algorithm (e.g., sha256)
> 
> 4.2.  Example Header
> 
>    X-Hashcash: ver=1; ts=20250523T042200Z; method=GET;
>                path=/api/data; origin=example.com;
>                nonce=abc123; suffix=42829; algo=sha256
> 
> 5.  Server Validation
> 
> The server MUST:
> 
> Reconstruct the string:
> "ver=1;ts=...;method=...;path=...;origin=...;nonce=...;suffix=..."
> 
> Hash it using the declared algorithm.
> 
> Check that the hash has the required number of leading zero bits (e.g., 22).
> 
> Ensure ts is within a reasonable clock skew (e.g., ±2 minutes).

I've been involved in doing comparable things to protect web sites in
the past. My observations were:

- not all requests are suitable for delivering preliminary work to be
  done; some browsers just send "Accept: */*" for any request and you
  don't know if they expect to download an image, a CSS etc so it's
  difficult to reliably submit some work. If done on any request using
  only a header, that might possibly be different ;

- there are huge discrepancies of processing power between clients. I've
  seen some that were processing in 5ms what took half a second to another
  one, thus significantly limiting either the site's accessibility or the
  level of protection.

- in order to address the point above, using timestamps to make sure the
  response doesn't come too fast is necessary to increase the difficulty.
  Then you realize that the client could wait before responding and keep
  a low difficulty, and that in the end delaying processing coudl be
  sufficient in many cases (still allows the same client machine to be
  used for other sites in parallel though)

- targetted attacks are not rare at all. You imagine that your locally
  written, and randomly obfuscated code is safe against industrialisation
  but no. Some use various JS engines that are as much capable as browsers,
  and you see in logs that they're definitely attacking your protection,
  so this will continue to be done.

- once the challenge is cracked, it can easily be reused by other bots
  in parallel until it triggers your rate limiting protection, but that
  doesn't stop scrapers who don't need to steal everything at once, it's
  only useful against brute force.

- some attacks came from bots running in real browsers.

Based on these points, one of my concerns is that standardizing such a
mechanism at the HTTP level would be an even higher incentive to build
services whose purpose is to offload this processing to infected devices
and browsers across the world, and that very likely this would even be
implemented in certain proxies to offload that from the browser itself
(i.e. "use my access point and do not discharge your battery").

Thus I tend to think that instead it could be useful to document what
works and what doesn't work so that everyone implements different solutions,
because *these* cost more to implement. But nowadays CPU and network are
almost free for attackers (OK they rent them but prices are low enough
to permit many attacks to already exist).

I agree with Demi Marie that inflicting *some* cost can be useful, but
as soon as the cost is high enough, it blocks legitimate users and it's
someone else who starts to pay for the CPU.

I've found that other methods were sometimes more effective, such as
always failing after a delay with no indication of blocking for brute-
force attacks, and randomly returning junk to web scrapers, that is
difficult to differentiate from valid contents (e.g. old contents,
ads, text made from swapped words etc). *This* does add a significant
cost to the attack because it quickly involves a human intervention.

Just my two cents,
Willy
Received on Friday, 23 May 2025 06:27:44 UTC