Re: URL parsing

On Wed, Apr 28, 2010 at 7:59 AM, Julian Reschke <julian.reschke@gmx.de> wrote:
> On 23.04.2010 00:03, Adam Barth wrote:
>> I haven't been paying that close attention to all the machinations
>> around URL parsing in this working group, but I've been looking into
>> URL parsing a bit recently.  In case it's useful to this working group
>> (or the IETF's URL working group), I've attached some raw data on how
>> various browsers parse URLs.  These tests are from this test suite:
>>
>> http://trac.webkit.org/browser/trunk/LayoutTests/fast/url
>>
>> which is adapted from these unit tests:
>>
>> http://code.google.com/p/google-url/source/browse/trunk/src/url_canon_unittest.cc
>>
>> I might send a summary of my findings after I analyze the data.
>
> very interesting.
>
> Here's a question; picking a random test case; scheme name normalization:
>
>  PASS canonicalize('HTTP://example.com/') is 'http://example.com/'
>
> Could you explain based on the HTML5 spec (in doubt an earlier version which
> doesn't yet rely on the IRI spec) why it's expected that the scheme name
> get's lowercased?

Oh, as I said above, this is "raw data."  The "expected" results are
just what the author of url_canon_unittest.cc thought the results
should be.  This data is purely an empirical measurement of what
browsers actually do.

In the case you mention, my recollection is that 3 out of 4 browsers
agree that you should lowercase the scheme.  Based on that evidence,
I'd probably recommend that the wayward browser also lowercase the
scheme.  However, I've haven't looked into these issues in enough
detail to know if there are other considerations that might cause us
to prefer that browsers not lowercase the scheme.

Adam

Received on Wednesday, 28 April 2010 15:42:07 UTC