CORS Proxy from Henry Story on 2012-07-06 (public-rww@w3.org from July 2012)

From: Henry Story <henry.story@bblfish.net>
Date: Fri, 6 Jul 2012 23:10:30 +0200
To: Read-Write-Web <public-rww@w3.org>
Cc: WebID <public-webid@w3.org>, Joe Presbrey <presbrey@gmail.com>, Mike Jones <mike.jones@manchester.ac.uk>, Romain BLIN <romain.blin@etu.univ-st-etienne.fr>, Julien Subercaze <julien.subercaze@univ-st-etienne.fr>
Message-Id: <7F270728-082E-4405-AD0F-9A0FD02B5673@bblfish.net>

Hi,

I just quickly put together a CORS Proxy [1], inspired by Joe Presbrey's data.fm CORS proxy [2].

But first, what is a CORS proxy?
--------------------------------

A CORS [3] proxy is needed in order to allow read-write-web pages containing javascript agents written with libraries such as rdflib [5] to fetch remote resources. Pages containing such javascript agents are able to fetch and parse RDF on the web, and thus crawl the web by following their robotic nose. A CORS Proxy is needed here because:

1- browsers restrict which sites javascript agents can fetch data from to those from which the javascript came from - the famous "same origin policy" ( javascript can only fetch resources from the same site it originated from)
2- CORS allows an exception to the above restriction, if the resource has the proper headers. For a GET request this is the Access-Control-Allow-Origin header
3- most RDF resources on the web are not served with such headers

Hence javascript agents running inside web browsers that need to crawl the web, need a CORS proxy, so that libraries such as rdflib can go forward and make those requests through the proxy. In short: a CORS proxy is a service that can forward the request to the appropriate server and on receiving the answer add the correct headers, if none were found.

Security
--------

So is there a security problem having a CORS proxy make a GET request for some information on behalf of JS Agent? This is an important question, because otherwise we'd be introducing a security hole with such a proxy.

In order to answer that question we need to explain why browsers have the same origin restriction.

The answer is quite simple I think. A Javascript agent running in a browser is using the credentials of the user when it makes requests for resources on the web. One can therefore think of the browser as acting as a secretary for the javascript agent: the JS agent makes a request, but does not log in to a web site, but instead asks the browser to fetch the information. The browser uses its authentication credentials - the user of the browser's credentials to be precise - to connect to remote sites and request resources. The remote site may be perfectly fine with the browser user/owner having access to the resource, but not like the idea of the agent in the browser doing so. (after all that could be some JS on some random site the user came across) In order to avoid this danger, the browser sends along with its requests an Origin: header and the URL of the host where the javascript was found. The service receiving such a request must respond with an Access-Control-Allow-Origin header to make clear that it is ok with the JS Agent receiving this information.
IF the browser finds out that the web site allows the JS Agent to receive the information too, then it will pass the information on to it.

This is a bit like what we discussed about a secretary agent on a web site requesting a resource On-Behalf-Of a user in our WebID delegation [5] . Here the Browser is the secretary, and the JS agent is the user being acted on behalf of. The difference is that the Secretary/Browser in this case is well known, and the JS Agent is the unknown; or put another way in the CORS case the server's authorisation policies were geared towards giving the browser owner access and not for the JS it is acting on behalf of, whereas in our delegation use case we were imagining the remote resource not being initially authorised directly to the secretary, but rather for the agent she was acting on behalf of. (small shift in focus)

Anonymous Proxy

So now it should be clear why using the proxy is not creating a new security issue. This is because the Proxy is not the Browser and so has NOT authenticated as the user when it is making a request. The Proxy is making an anonymous request to a remote resource. This remote resource is therefore public. As such it is fine to allow any JS agent to read it. This is particularly true of GET requests. But it should even be true of POST, PUT and DELETE requests. If those are public and allow anonymous usage, then it should be possible for a CORS proxy to do that on behalf of a javascript. It is quite possible that the CORS proxy might want to authenticate the user in order to not become a vector for denial of service attacks, and it could even give users a history of requests it made. Unless perhaps some people are placing identification cookies in URLs! (But one could argue that's their problem?!)

Authenticated Proxies

Now let us assume we have CORS proxies that can also authenticate with WebID. If they did so, then they would have to follow the same rules as the Browser: they should not pass on the information unless the server had allowed them to by setting the correct Access-Control-Allow-Origin headers, and allowing the javascript access, since the information they were given was not meant for this other JS agent.

But things could get a little more advanced yet: Imagine that the proxy authenticated itself on behalf of the user who made the request. The server serving the resource could then verify that the proxy was allowed to do act on behalf of the user using the procedure outlined in WebID delegation [5]. Then we would have to deal with WebID delegation plus CORS delegation. The server would know that the proxy was acting on behalf of the user, and that the user was acting on behalf of the JS Agent... This may be the use case we had trouble putting our finger on at this weeks WebID teleconf...

Improving CORS with LinkedData
------------------------------

Here is an idea to improve CORS: In WebID delegation we found a way to let the server know what relation the secretary had to the agent she acted on behalf of. That Agent she was acting on behalf of could add a

:me cert:secretaty myprofile:secr .

relation to his profile. This would help the server know what relation the secretary had to the agent she claimed to be working for.

With CORS this relationship is nowhere made explicit (as far as I can see): there is no way for me to tell a web site that the javascript I am using ( and that is hosted on my freedom box) is something I would like the server I am connecting to trust, as opposed to some javascript that just started executing itself on some server I came across. This makes me thing that
1. it would be useful if JS could be signed so as to have a better identity than just the identity of a whole site
2. we could create a relation such as the cert:secretary one that would allow me in my profile to say for example

:me cert:trustedJS <https://bblfish.net> .

IT's really much to vague, but it would make it easier for people to trust my JS.

OTHER TODOS
-----------

There are really quite a lot of details questions left open for CORS proxies. Should a CORS proxy return a 203? When? Who should it deal with error messages? .... I think we should organise those on a wiki page, and keep the discussion alive. My code at present was put together really quickly, but it made me ask a lot of questions whilst writing it.

Henry

[1] the code is here https://github.com/bblfish/Play20/blob/webid/framework/src/webid/src/main/scala/org/w3/readwriteweb/play/CORSProxy.scala
But I have not placed it online yet.
[2] Joe's proxy is online and available by sending GET requests here: http://data.fm/proxy?uri={uri}
and the code is somewhere here: https://github.com/linkeddata/data.fm
[3] http://www.w3.org/TR/cors/
[4] https://github.com/linkeddata/rdflib.js
[5] http://www.w3.org/wiki/WebID/Delegation

Social Web Architect
http://bblfish.net/

Received on Friday, 6 July 2012 21:11:03 UTC