Re: Marginal codepoints in IRIs/URLs

On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
<> wrote:
> That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are
> strictly reserved for internal processing, I think MS Word, among else, uses
> these. A browser that wanted to use these to simplify internal
> implementation would have trouble accepting them from the outside.

Given the way strings in browsers are really 16-bit code units
(Mozilla's Rust might change that, I hear) with no restrictions I
doubt that's a problem. And given that the input to the URL parser can
certainly contain one of those code points you have to handle them

> Consistency across formats is definitely a good thing. But there are some
> serious differences between text and identifiers. Something that's harmless
> in text (e.g. a zero-width space) may be hopeless in an IRI/URL (because it
> creates a different address, leading to confusion).

Unicode has lots of space for confusion. I'll note that HTML defines
an identifier too and it takes any code point except for ASCII
Incidentally for text/html, URL fragments can be used to refer to

> Actually, the characters that I currently would like to exclude most (not
> just in a spec, but actually in the browser implementations) are bidi
> control characters. RFC 3987 disallows them, but not in the syntax. Moving
> the restrictions to the syntax would give them more prominence. Allowing
> them in IRIs/URLs is just a wide open door for scams and phishers.

I don't really have an opinion on this. I can certainly assist filing
bugs on implementors, but I doubt they are interested in taking this
potential compatibility hit (if I understand correctly what you're


Received on Monday, 5 November 2012 15:21:17 UTC