Re: Marginal codepoints in IRIs/URLs from Anne van Kesteren on 2012-11-05 (uri@w3.org from November 2012)

From: Anne van Kesteren <annevk@annevk.nl>
Date: Mon, 5 Nov 2012 16:20:44 +0100
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: David Sheets <kosmo.zb@gmail.com>, Ian Hickson <ian@hixie.ch>, "Manger, James H" <James.H.Manger@team.telstra.com>, Christophe Lauret <clauret@weborganic.com>, Jan Algermissen <jan.algermissen@nordsc.com>, Ted Hardie <ted.ietf@gmail.com>, URI <uri@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <CADnb78irems7J-=6CKUN1u1QMEyfO6qrioW=iCO+EUTdJ=G1cw@mail.gmail.com>

On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are
> strictly reserved for internal processing, I think MS Word, among else, uses
> these. A browser that wanted to use these to simplify internal
> implementation would have trouble accepting them from the outside.

Given the way strings in browsers are really 16-bit code units
(Mozilla's Rust might change that, I hear) with no restrictions I
doubt that's a problem. And given that the input to the URL parser can
certainly contain one of those code points you have to handle them
somehow.

> Consistency across formats is definitely a good thing. But there are some
> serious differences between text and identifiers. Something that's harmless
> in text (e.g. a zero-width space) may be hopeless in an IRI/URL (because it
> creates a different address, leading to confusion).

Unicode has lots of space for confusion. I'll note that HTML defines
an identifier too and it takes any code point except for ASCII
whitespace: http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute
Incidentally for text/html, URL fragments can be used to refer to
it...

> Actually, the characters that I currently would like to exclude most (not
> just in a spec, but actually in the browser implementations) are bidi
> control characters. RFC 3987 disallows them, but not in the syntax. Moving
> the restrictions to the syntax would give them more prominence. Allowing
> them in IRIs/URLs is just a wide open door for scams and phishers.

I don't really have an opinion on this. I can certainly assist filing
bugs on implementors, but I doubt they are interested in taking this
potential compatibility hit (if I understand correctly what you're
proposing).

-- 
http://annevankesteren.nl/

Received on Monday, 5 November 2012 15:21:17 UTC