Re: parsing URI (references) according to RFC 3986

On Mon, Jun 20, 2011 at 12:37 AM, Julian Reschke <julian.reschke@gmx.de> wrote:
> On 2011-06-20 09:21, Adam Barth wrote:
>> I wouldn't worry about file URLs for a while.  They're vastly more
>> complex than all the other kinds of URLs put together.  If we could
>> get interoperability for even just http URLs, I'd be happy.
>
> +1
>
> So, what *is* the set of interop problems here?
>
> 1) Extracting them from a/@href and friends (whitespace treatment)
>
> 2) Handling invalid ASCII characters (SP, "\", "<", ">"...)
>
> 3) Handling non-ASCII characters in query component
>
> 4) Handling non-ASCII characters in authority components
>
> 5) Handling non-ASCII characters everywhere else
>
> Anything else?

The page https://raw.github.com/abarth/url-spec/master/tests/gurl-results/by-browser.txt
lists a bunch of inputs for which browsers provide different outputs.
In the interest of simplifying the problem, I'd ignore the IP-address
related issues for now as those are also somewhat complicated.

>From an information-theoretic point of view, all the information in
these tables needs to be included in the spec:

http://code.google.com/p/google-url/source/browse/trunk/src/url_canon_etc.cc#84
http://code.google.com/p/google-url/source/browse/trunk/src/url_canon_host.cc#78
http://code.google.com/p/google-url/source/browse/trunk/src/url_canon_path.cc#77
http://code.google.com/p/google-url/source/browse/trunk/src/url_canon_internal.cc#133

Different browsers have slightly different tables because they've each
tried to reverse engineer each other and didn't get it right because
there isn't a decent spec.  (Note: I don't care that particularly
whether the tables in the spec match GURL.  I just care that we end up
with the same tables in every implementation.)

Even just trivial things need to be cleaned up, like:

http://ExAmple.CoM/
http://www.example.com/##asdf

There's also some ugly stuff like (in JavaScript string-literal notation):

http://www.example.com/?q=\ud800\ud800

getting transformed to
http://www.example.com/?q=%26%2355296%3B%26%2355296%3B that needs to
be explained.

If we can get interop on all the http-scheme test cases in
<http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/>, I'd be
very happy.

Adam

Received on Monday, 20 June 2011 08:04:29 UTC