- From: ☻Mike Samuel <msamuel@google.com>
- Date: Thu, 16 Feb 2017 11:45:29 -0500
- To: uri@w3.org
I'm preparing a blogpost on the dangers of using simple regular expression matches to triage URLs, especially when blacklisting. I'm hoping that the sheer weight of corner cases will convince readers that there's no "one last tweak" that fixes all the issues so it's better not to start down that path. Could people chime in with corner cases that I've missed? Even ignoring the difference between https://url.spec.whatwg.org/#url-writing and 3986 URI references. cheers, mike ---- Draft below. ---- I often get asked to review code that grants privileges to certain URLs. A large chunk of the web is based on the same-origin policy, so this is inevitable, but I've noticed a few particularly bad ways of doing this. One of the worst is "let's identify a set of bad sites and describe it using regular expressions." Much ink has been spilt on the reasons to white-list instead of black-list, but I'd like to focus on why, even when black-listing is appropriate, regular expressions are not. Even competent tend to do a terrible job of writing regexps to match URLs For example, trying to come up with a blacklist filter that prevents access to evil.org, someone might write ^https?://evil[.]org/.* which looks seductively simple, and seems to succinctly express what we don't want. An effective blacklist needs to recognize many many ways to express equivalent URLs. There are many ways to defeat the simple regex above to cause it to spuriously not match a string that is effectively equivalent to "http://evil.org/." 1. Protocol-relative or mixed-case protocols: //evil.org/ HTTP://evil.org/ 2. Uppercase hostnames: http://EVIL.org/ 3. Absolute domain names (trailing dot disables domain suffix search): http://evil.org./ 4. Explicit ports: http://evil.org:80/, https://evil.org:443/ 5. Missing path: http://evil.org 6. Numeric hosts: http://1.2.3.4/ 7. Spaces around the attribute value. Most HTML attributes that contain URLs talk about "valid non-empty URL potentially surrounded by spaces" So " http://evil.org/ " might be mis-classified as a path relative URL. 8. Newlines don't match regex dot. "http://evil.org/\n/." 9. Subdomains: http://super.evil.org/ 10. Authorities: http://@evil.org/ 11. Open redirectors: http://tinyurl.com/2a5r7oh If you try to come up with a regex to handle all these cases you get something like this (?s)^[\t\n\x0C\r ]*(?i:https?)://([^:/#?@]*@)?(?i:([^/#?:]*[.])?evil[.]org[.]?)(:[^/#?:]*)?([/#?].*) which is probably still wrong, and lacks the expressiveness, succinctness, and maintainability of the original. You can reduce the attack surface by trying to "canonicalize" URLs, but fully canonicalizing a network address requires multiple network round trips, leaks information, often introduces a tarpitting vulnerability, and is only valid for as long as it takes the target domain to expire or the target to acquire a new trampoline domain.
Received on Thursday, 16 February 2017 16:46:03 UTC