Re: Marginal codepoints in IRIs/URLs from Anne van Kesteren on 2012-11-08 (public-iri@w3.org from November 2012)

From: Anne van Kesteren <annevk@annevk.nl>
Date: Thu, 8 Nov 2012 09:44:54 +0100
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: David Sheets <kosmo.zb@gmail.com>, Ian Hickson <ian@hixie.ch>, "Manger, James H" <James.H.Manger@team.telstra.com>, Christophe Lauret <clauret@weborganic.com>, Jan Algermissen <jan.algermissen@nordsc.com>, Ted Hardie <ted.ietf@gmail.com>, URI <uri@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <CADnb78gwuT80cdB8cxhJbO9JD-qozYhPysqqFtu+7Hqf1FZC9w@mail.gmail.com>

On Thu, Nov 8, 2012 at 5:16 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> Sorry to be late with my reply.

No worries!

> On 2012/11/06 0:20, Anne van Kesteren wrote:
>> On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
>> <duerst@it.aoyama.ac.jp>  wrote:
>> Given the way strings in browsers are really 16-bit code units
>> (Mozilla's Rust might change that, I hear) with no restrictions I
>> doubt that's a problem. And given that the input to the URL parser can
>> certainly contain one of those code points you have to handle them
>> somehow.
>
> Yes. But that also applies to a space, very obviously (Web pages without
> spaces would be really bad, except potentially in Chinese, Japanese,
> Thai,...:-), but still these are not part of valid URLs.

My current view is that it mostly makes sense to restrict certain code
points in the ASCII range as those are used as delimiters throughout
the ecosystem. HTML/Python use quotation marks, HTTP uses the colon
and whitespace, etc. So by putting the restrictions there, you make it
easy to copy and paste a URL around.

> While we are at it, could you go through the list in the LEIRI section
> (http://tools.ietf.org/html/draft-ietf-iri-3987bis-13#section-6.3) as an
> easy way to cross-check whether there are any other differences?

So LEIRIs are an even larger superset of IRIs. "\" seems problematic
as passing that to a URL parser results in it being handled as if it
were a "/". (I suppose we could make the parser handle that via a
flag, or before handing it to the parser you replace "\" with "%5C".)

U+0009, U+000A, and U+000D are pretty much always dropped on the floor
by a URL parser so those would be problematic too.

I am surprised [ and ] are not allowed. mailto:a@b?subject=[test]%20
is something I semi-frequently write and where I keep forgetting I
need to escape [ and ] to make it valid (I never had it fail anything
but the validator though).

>> I don't really have an opinion on this. I can certainly assist filing
>> bugs on implementors, but I doubt they are interested in taking this
>> potential compatibility hit (if I understand correctly what you're
>> proposing).
>
> Only scammers should have any reason to use these. It's way more a security
> issue (in which browsers often show a very strong interest) than a
> compatibility issue. I'll try to follow up on this in a separate mail, but
> that may not be this week, sorry.

What would be interesting is affected code points, and expected
results. There's a few cases currently where the URL parser has a hard
fail. E.g. if you resolve "/test" against "about:blank". We could
expand that to include these code points I suppose, but it seems like
a major risk.

-- 
http://annevankesteren.nl/

Received on Thursday, 8 November 2012 08:45:27 UTC