Re: Marginal codepoints in IRIs/URLs

On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are
> strictly reserved for internal processing, I think MS Word, among else, uses
> these. A browser that wanted to use these to simplify internal
> implementation would have trouble accepting them from the outside.

Given the way strings in browsers are really 16-bit code units
(Mozilla's Rust might change that, I hear) with no restrictions I
doubt that's a problem. And given that the input to the URL parser can
certainly contain one of those code points you have to handle them
somehow.


> Consistency across formats is definitely a good thing. But there are some
> serious differences between text and identifiers. Something that's harmless
> in text (e.g. a zero-width space) may be hopeless in an IRI/URL (because it
> creates a different address, leading to confusion).

Unicode has lots of space for confusion. I'll note that HTML defines
an identifier too and it takes any code point except for ASCII
whitespace: http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute
Incidentally for text/html, URL fragments can be used to refer to
it...


> Actually, the characters that I currently would like to exclude most (not
> just in a spec, but actually in the browser implementations) are bidi
> control characters. RFC 3987 disallows them, but not in the syntax. Moving
> the restrictions to the syntax would give them more prominence. Allowing
> them in IRIs/URLs is just a wide open door for scams and phishers.

I don't really have an opinion on this. I can certainly assist filing
bugs on implementors, but I doubt they are interested in taking this
potential compatibility hit (if I understand correctly what you're
proposing).


-- 
http://annevankesteren.nl/

Received on Monday, 5 November 2012 15:21:16 UTC