Marginal codepoints in IRIs/URLs from Martin J. Dürst on 2012-11-05 (uri@w3.org from November 2012)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Mon, 05 Nov 2012 20:19:35 +0900
To: Anne van Kesteren <annevk@annevk.nl>
CC: David Sheets <kosmo.zb@gmail.com>, Ian Hickson <ian@hixie.ch>, "Manger, James H" <James.H.Manger@team.telstra.com>, Christophe Lauret <clauret@weborganic.com>, Jan Algermissen <jan.algermissen@nordsc.com>, Ted Hardie <ted.ietf@gmail.com>, URI <uri@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <5097A0C7.90402@it.aoyama.ac.jp>

Hello Anne,

On 2012/11/05 19:31, Anne van Kesteren wrote:
> On Mon, Nov 5, 2012 at 10:53 AM, "Martin J. Dürst"
> <duerst@it.aoyama.ac.jp>  wrote:
>> Private Unicode ranges were originally banned everywhere, because they are
>> not intended for public interchange. We allowed them in the query part,
>> because sometimes you may want to use them as a payload. That's how we got
>> to where we are. [If it interests you, this happened in August 2003, see
>> http://tools.ietf.org/html/draft-duerst-iri-03 and
>> http://tools.ietf.org/rfcdiff?url2=draft-duerst-iri-03.txt.]
>>
>> If you have a good reason to change that, please tell us.
>
> Alignment with HTML.

> There's actually another change required for
> that, see https://www.w3.org/Bugs/Public/show_bug.cgi?id=19743 for
> details.

That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are 
strictly reserved for internal processing, I think MS Word, among else, 
uses these. A browser that wanted to use these to simplify internal 
implementation would have trouble accepting them from the outside. Of 
course, in IRIs/URLs, they make even less sense. I'd have somebody with 
some MS software try and see what happens if they open an HTML document 
with some of these inside.

U+FFFD is the replacement character. It's difficult to disallow that in 
the text. In identifiers, it doesn't make much sense.

> I'm also happy for HTML to change, but it seems to me that
> for code points higher than U+007F we should have some kind of
> consistent set of rules across syntaxes, unless the the code points
> are problematic for that particular format.

Yes, that makes sense. On the other hand, for implementers that work 
independent of HTML, you need a standalone definition.

>> Looking at the bigger picture, there are literally dozens groups of
>> characters/codepoints like private use characters in Unicode that are almost
>> never used, and almost always a bad idea, in IRIs. We could spend lots of
>> hours discussing the merit of including or excluding them, but I think we
>> can use our time for better stuff.
>
> I'm not interested in a code-point-by-code-point discussion, just the
> bigger picture, and consistency in requirements across the formats we
> develop.

Consistency across formats is definitely a good thing. But there are 
some serious differences between text and identifiers. Something that's 
harmless in text (e.g. a zero-width space) may be hopeless in an IRI/URL 
(because it creates a different address, leading to confusion).

Of course, I have to admit that in the IRI spec, we only excluded the 
most egregious of these (private use characters in most parts, 
U+FFF0-FFFD everywhere).

Actually, the characters that I currently would like to exclude most 
(not just in a spec, but actually in the browser implementations) are 
bidi control characters. RFC 3987 disallows them, but not in the syntax. 
Moving the restrictions to the syntax would give them more prominence. 
Allowing them in IRIs/URLs is just a wide open door for scams and phishers.

Regards,   Martin.

Received on Monday, 5 November 2012 11:20:19 UTC