Re: Canonicalization of IRIs in security contexts

Hello Yaron,

I just have entered your issue into our tracker as ticket #45 
(http://trac.tools.ietf.org/wg/iri/trac/ticket/45). This is mainly to 
not forget it.

Eventually this should be decomposed into more fine-grained issues (and 
a few high-level issues, such as whether and to what extent we want to 
deal with these concerns in the main IRI spec or we better put it into a 
separate draft.

Regards,   Martin.

On 2010/04/06 8:32, Yaron Goland wrote:
> Of late I've been worrying about the use of URIs/IRIs in security contexts. So I wrote up a paper that explores some of the issues and have included it below. I shared this paper with Ted Hardie, Larry Masinter and Dave Thaler. We were mostly discussing who should actually own worrying about this problem. Ted suggested that NewPrep (assuming it gets created as a WG) should own this. Larry just asked that we move this discussion to the IRI mailing list as the IRI WG is now worrying about security considerations. So here is the paper.
>
> Thoughts?
>
>                                  Thanks,
>
>                                                  Yaron
> Secure Comparison of URIs and IRIs in security token environments
> Current purpose of this document
> The purpose of this paper is to motivate that a problem exists with URI canonicalization in the context of security token environments and that this problem needs to be resolved.
>
> This paper does not contain nor attempt to contain an exhaustive collection of URI canonicalization issues. Rather it contains what is hoped to be a sufficiently large collection of canonicalization issues to motivate the need for a solution.
> Problem Description
> This paper looks at issues related to using URIs in secure ways in security token based access control systems. Examples of such systems include WS-*, SAML-P and OAuth WRAP. In such systems a variety of participants in the security infrastructure are identified by URIs. For example, requesters of security tokens are sometimes identified with URIs. The issuers of security tokens and the relying parties who are intended to consume security tokens are frequently identified by URIs. Claims in security tokens often have their types defined using URIs and the values of the claims can also be URIs.
>
> The most common operation on URIs in a security token context is a straight forward comparison. For example, a relying party is consuming a security token. The relying party will want to look up the name of the issuer of the security token, which can be a URI, in their local database and find the keying material associated with that issuer. The relying party will then use the keying material to validate that the security token is valid. This pattern requires a simple comparison of the submitted URIs with recorded URIs.
>
> As outlined in the rest of this document there are a number of decisions that a canonicalizer can make when canonicalizing URIs for comparison purposes. For example, some URI canonicalizers will strip out fragments so that http://example.com/foo#1234 and http://example.com/foo will be treated as equal. Similar treatment is also provided for userinfo, e.g. http://joe:password@example.com/foo will be treated the same as http://example.com/foo. And all of this is before even beginning to think through Unicode issues such as how to deal with case insensitive environments.
>
> The reason these inconsistencies matter is that they open up potential security holes. For example, the Foo corporation has paid money to the example.com corporation for access to the stuff service. The Foo corporation allows its employees to create accounts on the stuff service. So that user Joe could get the account http://example.com/stuff/FooCorp/joe and the user Jane could get http://example.com/stuff/FooCorp/Jane. It turns out, however, that Foo Corp's canonicalizer honors fragments for comparison purposes. So Jack, who is a malicious employee of Foo Corp, asks to create an account at example.com with the name joe#stuff. Foo Corp's URI logic checks its records for accounts it has created with stuff and sees that there is no account with the name joe#stuff so, in its records, it associates the account joe#stuff with Jack and will only issue tokens good for use with http://example.com/stuff/FooCorp/joe#stuff to Jack.
>
> Jack, the attacker, goes to the security token service at Foo Corp and asks for a security token good for http://example.com/stuff/FooCorp/joe#stuff. FooCorp is happy to issue the token since Jack is the legitimate owner (in Foo Corp's eyes) of the joe#stuff account. Jack then submits the security token in a request to http://example.com/stuff/FooCorp/joe.
>
> But example.com uses a URI canonicalizer, that for the purposes of checking equality, ignores fragments. So when example.com looks in the security token to see if the requester has permission from Foo Corp to access the given account it successfully matches the URI in the security token, http://example.com/stuff/FooCorp/joe#stuff with the request-URI http://example.com/stuff/FooCorp/joe.
>
> Leveraging the inconsistencies in the canonicalizers used by Foo Corp and example.com, Jack is able to successfully launch an elevation of privilege attack.
> What's up with the colors and the weird SCUXXX identifiers?
> I track requirements using unique identifiers. So each requirement gets an identifier of the form SCUXXX where XXX are three alphabetic letters. There is no meaning to each identifier. I just generate them as I need them. I use a dedicated style for the requirements both to highlight them and also to make it easy to generate a table of them automatically at the end of the doc.
> Relative URIs
> Is it possible to have meaningful URI comparisons involving relative URIs or do we require that all URIs are fully qualified before being submitted to the canonicalization algorithm?
>
>
> SCUAAA - A secure URI canonicalization profile MUST define if it allows relative URIs.
>
> Hostname or URI resolution
> Some systems (specifically Java) used to follow the rule that if two host names resolved to the same IP then the host names were considered equal. But with the introduction of virtual hosting and dynamic IP addresses this method of comparison cannot be relied upon.
>
> In addition a comparison mechanism which relies on the ability to resolve identifiers like host names to other identifies like IP addresses inherently leaks information about security decisions to outsiders since these kind of queries are often publicly viewable (e.g. someone could track DNS traffic and from that determine who an entity was likely getting security tokens from or being asked to generate security tokens to). So are there security issues in requiring name resolution as part of the canonicalization algorithm?
>
> And, if a canonicalization algorithm does require some kind of network access to work, how does it function in network restricted or offline contexts?
>
>
> SCUAAB - A secure URI canonicalization profile MUST define if it requires network access in order to canonicalize a URI.
>
>
> SCUAAS - A secure URI canonicalization profile MUST define it compares host name values to host name values or if it requires the host name to first be resolved to an IP address or some other underlying identifier as part of the canonicalization process.
>
> Fragment components
> Some URI formats include fragment identifiers. These are typically handles to locations within a resource and are used for local reference. A classic example is the use of fragments in HTTP URLs where a URL of the form http://foo.com/blah.html#ick means "retrieve the resource http://foo.com/blah.html and once it has arrived locally find the HTML anchor named "Ick" and display that.
>
> So, for example, when a user clicks on the link http://foo.com/blah.html#baz a browser will check its cache by doing a URI comparison for http://foo.com/blah.html and if the resource is present in the cache a match is declared.
>
>
> SCUAAC - A secure URI canonicalization profile MUST define how URI fragments are to be treated as part of the canonicalization process.
>
> Query components
> Similar to fragments, there is the question of are http://foo.com/blah and http://foo.com/blah? equal or different?
>
>
> SCUAAR - A secure URI canonicalization profile MUST define how query components of URIs are to be treated as part of the canonicalization process.
>
> But what about the values in a query component? Should http://foo.com/blah?ick=bick&foo=bar be considered equal to http://foo.com/blah?foo=bar&ick=bick?
>
>
> SCUAAY - A secure URI canonicalization profile MUST define if it will allow for the re-ordering of query argument values and if so, how.
>
> URI Scheme names
> RFC 3986 defines URI schemes as being case insensitive and in section 6.2.2.1 specifies that scheme names should be normalized to lower case characters. But separately it specifies that percent-encoded characters should be normalized to upper case characters. Do we want this inconsistency?
>
>
> SCUAAF - A secure URI canonicalization profile MUST define how URI scheme names are to be normalized (e.g. to upper or lower case?)
>
> Host names
>
> SCUAAM - A secure URI canonicalization profile MUST define how URI host names are to be normalized (e.g. to upper or lower case characters?)
>
> Userinfo
> RFC 3986 defines the userinfo production that allows arbitrary data about the user of the URI to be placed before @ signs in URIs. For example: http://joe:jane:jack:yo@example.com/bar has the value "joe:jane:jack:yo" as its userinfo. When canonicalizing a URI in a security context should be the userinfo be left in? Some URI comparison services for example treat http://joe:ick@example.com and http://example.com as being equal.
>
>
> SCUABD - A secure URI canonicalization profile MUST specify what is to happen to any userinfo included in a URI during the canonicalization process.
>
> IPv6 Host Names
> IPv6 names have a wide variety of alternate but semantically identical syntaxes.
>
>
> SCUAAK - A secure URI canonicalization profile MUST define how IPv6 addresses are canonicalized to a standard format.
>
> IPv4 Host Names
> The BNF for URIs is ambiguous when it comes to distinguishing IPv4 addresses from registered names. RFC 3986 tries to resolve this ambiguity by arguing that when processing a host name if it matches the IPv4 production IPv4address then it is an IPv4 address otherwise it is a reg-name. But this solution seems on its face unsatisfying as it is likely to be confusing to normal users. Can we really expect a normal user when dealing with a security context to fully grasp that 12.12.12.12 will be treated as an IPv4 address and not as a DNS host name? Maybe IPv4 addresses should just be banned from canonicalization because of the confusion they can cause? Or perhaps domain names that look like IPv4 addresses should be banned? This is similar in spirit to the homograph problem in Unicode.
>
>
> SCUABD - A secure URI canonicalization profile MUST specify how it handles IPv4 addresses and the ambiguities of IPv4 versus reg-names.
>
> DNS versus non-DNS names
> RFC 3986 explicitly allows for the idea that host names might not be DNS names (or IP addresses). But no mechanism is provided to explicitly indicate when a host name is not a DNS name. This can lead to potential security issues if the sender of a URI thinks they are referring to a non-DNS name while the receiver of the URI believes that the host name is a DNS Name.
>
>
> SCUAAT - A secure URI canonicalization profile MUST define if non-DNS/IP names are allowed as host names.
>
> Punycode versus non-ASCII Host name characters
> RFC 3986 in section 3.2.2 specifically allows for the use of URL encoded UTF-8 characters in the host name, in addition to the use of IDNA names. This create an ambiguity for canonicalization since it isn't clear if all host names that involve international characters should be canonicalized to IDNA names or perhaps IDNA names and host names with international characters are considered mutually exclusive?
>
>
> SCUAAU - A secure URI canonicalization profile MUST define the canonicalization relationship of host names with internationalized characters and IDNA names.
>
> Path Segment Normalization
> RFC 3986 supports the use of path segment values such as ./ or ../ for relative URLs. Strictly speaking including such path segment values in a fully qualified URI is syntactically illegal but RFC 3986 nevertheless defines an algorithm to remove them (see section 4.1 of RFC 3986).
>
>
> SCUAAP - A secure URI canonicalization profile MUST define if "." Or ".." characters are allowed as relative references in fully qualified URIs and if so how they are to be canonicalized.
>
> Percent Encoding
>
> SCUAAY - A secure URI canonicalization profile MUST define how to canonicalize percent encoded characters that are not going to be unencoded.
>
> RFC 3986 actually specifies that alphabetic characters in percent encoding (which are required to be in US-ASCII) should be canonicalized to upper case, which is inconsistent with how host names and scheme names are treated.
>
>
> SCUAAZ - A secure URI canonicalization profile MUST define if characters that are percent encoded but do not require percent encoding should be decoded as part of the canonicalization process.
>
> The previous, btw, assumes that we can even tell when a character didn't need encoding. For example, a delimiter character like "/" often needs encoding so if we see one encoded, especially in a scheme we don't explicitly support, it's ambiguous if it was unnecessarily encoded. On the other hand if we see the letter "a" encoded it's highly unlikely that was unnecessary. But is it guaranteed that it is unnecessary? Section 2.3 of RFC 3986 defines a set of characters it argues should be decoded but is that decoding required in the canonicalization process?
>
>
> SCUABA - A secure URI canonicalization profile MUST define when, if ever, it requires percent encoded characters to be decoded.
>
> Unicode
>
> SCUABF - I need a stiff drink before I even begin to think about this section. But http://unicode.org/reports/tr36/ makes for some motivational reading. Or for those with a more visual bent - http://www.casabasecurity.com/files/Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf.
>
> Transcription
> One of the key goals of the URI design was to enable human transcription of URIs. But is this a goal for canonicalization in a secure context? Should secure canonicalization just worry about having an easy to generate machine readable format or is there a requirement that the output of the canonicalization be transcribable?
>
>
> SCUABD - A secure URI canonicalization profile MUST define if transcription of the canonicalized URIs it produces is a goal.
>
> Handling unrecognized schemes
> Is it ever safe for a canonicalizer to canonicalize an unrecognized URI/IRI scheme type? For example, a new URI scheme type IPPY might have a default port of X. Therefore IPPY://foo.com:X and IPPY://foo.com should be treated as equivalent since X is the default port for the IPPY scheme. But a canonicalizer that doesn't know the IPPY scheme also will not know its default port and so cannot safely canonicalize a URI with an unrecognized scheme. Similar issues apply when dealing with default hosts. A canonicalizer dealing with a file URL that didn't know that localhost is a reserved host value and equivalent to an empty host couldn't canoncalize in a reasonable way.
>
>
> SCUABC - A secure URI canonicalization profile MUST specify if the canonicalizer is allowed to canonicalize unrecognized URI schemes and if so, how.
>
> Handling unrecognized IP address types
> RFC 3986 introduces an extension point to enable future changes to the IP address format using the IPvFuture production. But can a canonicalizer safely deal with an IP syntax it doesn't explicitly recognize? The example of IPv6 which has many forms with the same semantic content is instructive as a canonicalizer that encountered an IPv6 address but didn't recognize such addresses could not perform necessary canonicalization.
>
>
> SCUABE - A secure URI canonicalization profile MUST specify if the canonicalizer is allowed to canonicalize unrecognized IP address formats and if so, how.
>
> Handling syntactically illegal URIs
> What happens if a URI that is submitted for canonicalization is syntactically illegal? Do we try to canonicalize around the errors or just reject the URI all together? This all assumes that the canonicalization profile even requires detecting if the URI is syntactically legal in the first place.
>
>
> SCUABD - A secure URI canonicalization profile MUST specify how it handles URIs that are syntactically illegal.
>
> Which canonicalization profile is being used?
> Can we really have a single canonicalization profile or do we need multiple ones? At a minimum I would imagine that we would have one profile for environments that treat URIs in a case sensitive manner and another for URIs in a case insensitive manner.
>
>
> SCUABH - A secure URI canonicalization profile MUST specify how many different canonicalization profiles it supports.
>
> And if there is more than one canonicalization profile doesn't this place requirements on security token formats and protocols that use the canonicalization mechanism to explicitly define which profile they expect will be used with a particular URI?
>
>
> SCUABI - A secure URI canonicalization profile MUST specify what requirements, if any, it places on formats or protocols that leverage the profile.
>
> Proposed Requirements
> This is where the actual URI canonicalization profile(s) would go.
> Q&A
> This is where we would answer questions about the tradeoffs and design choices about the canonicalization profile(s).
> Appendix
> General Requirements
>
> SCUAAA - A secure URI canonicalization profile MUST define if it allows relative URIs.
>
> SCUAAB - A secure URI canonicalization profile MUST define if it requires network access in order to canonicalize a URI.
>
> SCUAAS - A secure URI canonicalization profile MUST define it compares host name values to host name values or if it requires the host name to first be resolved to an IP address or some other underlying identifier as part of the canonicalization process.
>
> SCUAAC - A secure URI canonicalization profile MUST define how URI fragments are to be treated as part of the canonicalization process.
>
> SCUAAR - A secure URI canonicalization profile MUST define how query components of URIs are to be treated as part of the canonicalization process.
>
> SCUAAY - A secure URI canonicalization profile MUST define if it will allow for the re-ordering of query argument values and if so, how.
>
> SCUAAF - A secure URI canonicalization profile MUST define how URI scheme names are to be normalized (e.g. to upper or lower case?)
>
> SCUAAM - A secure URI canonicalization profile MUST define how URI host names are to be normalized (e.g. to upper or lower case characters?)
>
> SCUABD - A secure URI canonicalization profile MUST specify what is to happen to any userinfo included in a URI during the canonicalization process.
>
> SCUAAK - A secure URI canonicalization profile MUST define how IPv6 addresses are canonicalized to a standard format.
>
> SCUABD - A secure URI canonicalization profile MUST specify how it handles IPv4 addresses and the ambiguities of IPv4 versus reg-names.
>
> SCUAAT - A secure URI canonicalization profile MUST define if non-DNS/IP names are allowed as host names.
>
> SCUAAU - A secure URI canonicalization profile MUST define the canonicalization relationship of host names with internationalized characters and IDNA names.
>
> SCUAAP - A secure URI canonicalization profile MUST define if "." Or ".." characters are allowed as relative references in fully qualified URIs and if so how they are to be canonicalized.
>
> SCUAAY - A secure URI canonicalization profile MUST define how to canonicalize percent encoded characters that are not going to be unencoded.
>
> SCUAAZ - A secure URI canonicalization profile MUST define if characters that are percent encoded but do not require percent encoding should be decoded as part of the canonicalization process.
>
> SCUABA - A secure URI canonicalization profile MUST define when, if ever, it requires percent encoded characters to be decoded.
>
> SCUABD - A secure URI canonicalization profile MUST define if transcription of the canonicalized URIs it produces is a goal.
>
> SCUABC - A secure URI canonicalization profile MUST specify if the canonicalizer is allowed to canonicalize unrecognized URI schemes and if so, how.
>
> SCUABE - A secure URI canonicalization profile MUST specify if the canonicalizer is allowed to canonicalize unrecognized IP address formats and if so, how.
>
> SCUABD - A secure URI canonicalization profile MUST specify how it handles URIs that are syntactically illegal.
>
> SCUABH - A secure URI canonicalization profile MUST specify how many different canonicalization profiles it supports.
>
> SCUABI - A secure URI canonicalization profile MUST specify what requirements, if any, it places on formats or protocols that leverage the profile.
>
> Implementation Requirements
> No table of contents entries found.
> Open Issues
>
> SCUABF - I need a stiff drink before I even begin to think about this section. But http://unicode.org/reports/tr36/ makes for some motivational reading. Or for those with a more visual bent - http://www.casabasecurity.com/files/Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf.
>
> Last Used ID
> SCUABI
>
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Tuesday, 28 September 2010 04:32:20 UTC