- From: Willy Tarreau <w@1wt.eu>
- Date: Fri, 23 May 2025 08:27:36 +0200
- To: Melvin Carvalho <melvincarvalho@gmail.com>
- Cc: Demi Marie Obenour <demiobenour@gmail.com>, Watson Ladd <watsonbladd@gmail.com>, ietf-http-wg@w3.org
On Fri, May 23, 2025 at 04:34:32AM +0200, Melvin Carvalho wrote: > > That said, I do think there is a significant problem with building proof > > of work into HTTP, and that is that while it forces clients to consume > > additional CPU time, it does *not* force them to run a full-fledged > > browser. > > Scrapers strongly prefer to *not* run full browsers, as running a full > > browser significantly increases memory requirements. > > See > > https://old.reddit.com/r/selfhosted/comments/1jy6mug/fail2ban_400_sendmail_blocks_in_12_hours/mmxxd1v/ > > for where I got this information. > > > > > It might look like this for a stateless version (as opposed to > challenge/response) > > Header: X-Hashcash > > Fields: > > - ver: Hashcash version (currently "1") > > - ts: Timestamp truncated to the current minute (UTC ISO8601) > > - method: HTTP method (e.g., GET) > > - path: Request target path (e.g., /api/data) > > - origin: Hostname of the server (e.g., example.com) > > - nonce: Random client-generated value > > - suffix: Value incremented by client to meet PoW difficulty > > - algo: Hash algorithm (e.g., sha256) > > 4.2. Example Header > > X-Hashcash: ver=1; ts=20250523T042200Z; method=GET; > path=/api/data; origin=example.com; > nonce=abc123; suffix=42829; algo=sha256 > > 5. Server Validation > > The server MUST: > > Reconstruct the string: > "ver=1;ts=...;method=...;path=...;origin=...;nonce=...;suffix=..." > > Hash it using the declared algorithm. > > Check that the hash has the required number of leading zero bits (e.g., 22). > > Ensure ts is within a reasonable clock skew (e.g., ±2 minutes). I've been involved in doing comparable things to protect web sites in the past. My observations were: - not all requests are suitable for delivering preliminary work to be done; some browsers just send "Accept: */*" for any request and you don't know if they expect to download an image, a CSS etc so it's difficult to reliably submit some work. If done on any request using only a header, that might possibly be different ; - there are huge discrepancies of processing power between clients. I've seen some that were processing in 5ms what took half a second to another one, thus significantly limiting either the site's accessibility or the level of protection. - in order to address the point above, using timestamps to make sure the response doesn't come too fast is necessary to increase the difficulty. Then you realize that the client could wait before responding and keep a low difficulty, and that in the end delaying processing coudl be sufficient in many cases (still allows the same client machine to be used for other sites in parallel though) - targetted attacks are not rare at all. You imagine that your locally written, and randomly obfuscated code is safe against industrialisation but no. Some use various JS engines that are as much capable as browsers, and you see in logs that they're definitely attacking your protection, so this will continue to be done. - once the challenge is cracked, it can easily be reused by other bots in parallel until it triggers your rate limiting protection, but that doesn't stop scrapers who don't need to steal everything at once, it's only useful against brute force. - some attacks came from bots running in real browsers. Based on these points, one of my concerns is that standardizing such a mechanism at the HTTP level would be an even higher incentive to build services whose purpose is to offload this processing to infected devices and browsers across the world, and that very likely this would even be implemented in certain proxies to offload that from the browser itself (i.e. "use my access point and do not discharge your battery"). Thus I tend to think that instead it could be useful to document what works and what doesn't work so that everyone implements different solutions, because *these* cost more to implement. But nowadays CPU and network are almost free for attackers (OK they rent them but prices are low enough to permit many attacks to already exist). I agree with Demi Marie that inflicting *some* cost can be useful, but as soon as the cost is high enough, it blocks legitimate users and it's someone else who starts to pay for the CPU. I've found that other methods were sometimes more effective, such as always failing after a delay with no indication of blocking for brute- force attacks, and randomly returning junk to web scrapers, that is difficult to differentiate from valid contents (e.g. old contents, ads, text made from swapped words etc). *This* does add a significant cost to the attack because it quickly involves a human intervention. Just my two cents, Willy
Received on Friday, 23 May 2025 06:27:44 UTC