On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote: > That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are > strictly reserved for internal processing, I think MS Word, among else, uses > these. A browser that wanted to use these to simplify internal > implementation would have trouble accepting them from the outside. Given the way strings in browsers are really 16-bit code units (Mozilla's Rust might change that, I hear) with no restrictions I doubt that's a problem. And given that the input to the URL parser can certainly contain one of those code points you have to handle them somehow. > Consistency across formats is definitely a good thing. But there are some > serious differences between text and identifiers. Something that's harmless > in text (e.g. a zero-width space) may be hopeless in an IRI/URL (because it > creates a different address, leading to confusion). Unicode has lots of space for confusion. I'll note that HTML defines an identifier too and it takes any code point except for ASCII whitespace: http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute Incidentally for text/html, URL fragments can be used to refer to it... > Actually, the characters that I currently would like to exclude most (not > just in a spec, but actually in the browser implementations) are bidi > control characters. RFC 3987 disallows them, but not in the syntax. Moving > the restrictions to the syntax would give them more prominence. Allowing > them in IRIs/URLs is just a wide open door for scams and phishers. I don't really have an opinion on this. I can certainly assist filing bugs on implementors, but I doubt they are interested in taking this potential compatibility hit (if I understand correctly what you're proposing). -- http://annevankesteren.nl/Received on Monday, 5 November 2012 15:21:17 UTC
This archive was generated by hypermail 2.4.0 : Sunday, 10 October 2021 22:17:56 UTC