Pitfalls in URI matching from ☻Mike Samuel on 2017-02-16 (uri@w3.org from February 2017)

From: ☻Mike Samuel <msamuel@google.com>
Date: Thu, 16 Feb 2017 11:45:29 -0500
To: uri@w3.org
Message-ID: <CAHBJ-bk+Hvuf1K9sE2zeA5wajkg1HJSp2c=rxS+HTQqDUqS7ng@mail.gmail.com>
I'm preparing a blogpost on the dangers of using simple regular
expression matches to triage URLs, especially when blacklisting.

I'm hoping that the sheer weight of corner cases will convince readers
that there's no "one last tweak" that fixes all the issues so it's
better not to start down that path.

Could people chime in with corner cases that I've missed?

Even ignoring the difference between
https://url.spec.whatwg.org/#url-writing and 3986 URI references.

cheers,
mike

----
Draft below.
----
I often get asked to review code that grants privileges to certain
URLs.  A large chunk of the web is based on the same-origin policy, so
this is inevitable, but I've noticed a few particularly bad ways of
doing this.

One of the worst is "let's identify a set of bad sites and describe it
using regular expressions."  Much ink has been spilt on the reasons to
white-list instead of black-list, but I'd like to focus on why, even
when black-listing is appropriate, regular expressions are not.

Even competent tend to do a terrible job of writing regexps to match URLs
For example, trying to come up with a blacklist filter that prevents
access to evil.org, someone might write

    ^https?://evil[.]org/.*

which looks seductively simple, and seems to succinctly express what we don't
want.

An effective blacklist needs to recognize many many ways to express
equivalent URLs.

There are many ways to defeat the simple regex above
to cause it to spuriously not match a string that is effectively
equivalent to "http://evil.org/."

1. Protocol-relative or mixed-case protocols: //evil.org/ HTTP://evil.org/
2. Uppercase hostnames: http://EVIL.org/
3. Absolute domain names (trailing dot disables domain suffix search):
    http://evil.org./
4. Explicit ports: http://evil.org:80/, https://evil.org:443/
5. Missing path: http://evil.org
6. Numeric hosts: http://1.2.3.4/
7. Spaces around the attribute value.
   Most HTML attributes that contain URLs talk about
   "valid non-empty URL potentially surrounded by spaces"
   So " http://evil.org/ " might be mis-classified as a path relative URL.
8. Newlines don't match regex dot. "http://evil.org/\n/."
9. Subdomains: http://super.evil.org/
10. Authorities: http://@evil.org/
11. Open redirectors: http://tinyurl.com/2a5r7oh

If you try to come up with a regex to handle all these cases you get
something like this

    (?s)^[\t\n\x0C\r
]*(?i:https?)://([^:/#?@]*@)?(?i:([^/#?:]*[.])?evil[.]org[.]?)(:[^/#?:]*)?([/#?].*)

which is probably still wrong, and lacks the expressiveness,
succinctness, and maintainability of the original.

You can reduce the attack surface by trying to "canonicalize" URLs, but
fully canonicalizing a network address requires multiple network round trips,
leaks information, often introduces a tarpitting vulnerability, and is only
valid for as long as it takes the target domain to expire or the target to
acquire a new trampoline domain.
Received on Thursday, 16 February 2017 16:46:03 UTC