Re: Marginal codepoints in IRIs/URLs from Martin J. Dürst on 2012-11-08 (uri@w3.org from November 2012)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 08 Nov 2012 13:16:15 +0900
To: Anne van Kesteren <annevk@annevk.nl>
CC: David Sheets <kosmo.zb@gmail.com>, Ian Hickson <ian@hixie.ch>, "Manger, James H" <James.H.Manger@team.telstra.com>, Christophe Lauret <clauret@weborganic.com>, Jan Algermissen <jan.algermissen@nordsc.com>, Ted Hardie <ted.ietf@gmail.com>, URI <uri@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <509B320F.3090100@it.aoyama.ac.jp>

Hello Anne,

Sorry to be late with my reply.

On 2012/11/06 0:20, Anne van Kesteren wrote:
> On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
> <duerst@it.aoyama.ac.jp>  wrote:
>> That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are
>> strictly reserved for internal processing, I think MS Word, among else, uses
>> these. A browser that wanted to use these to simplify internal
>> implementation would have trouble accepting them from the outside.
>
> Given the way strings in browsers are really 16-bit code units
> (Mozilla's Rust might change that, I hear) with no restrictions I
> doubt that's a problem. And given that the input to the URL parser can
> certainly contain one of those code points you have to handle them
> somehow.

Yes. But that also applies to a space, very obviously (Web pages without 
spaces would be really bad, except potentially in Chinese, Japanese, 
Thai,...:-), but still these are not part of valid URLs.

Anyway, just for your information, here is what 
http://tools.ietf.org/html/draft-ietf-iri-3987bis currently say about 
the two classes of characters in question, in the LEIRI section 
(http://tools.ietf.org/html/draft-ietf-iri-3987bis-13#section-6.3, 
Characters Allowed in Legacy Extended IRIs but not in IRIs).

       Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
       10FFFD): Display and interpretation of these code points is by
       definition undefined without private agreement.  Therefore, these
       code points are not suited for use on the Internet.  They are not
       interoperable and may have unpredictable effects.

       Specials (U+FFF0-FFFD): These code points provide functionality
       beyond that useful in a Legacy Extended IRI, for example byte
       order identification, annotation, and replacements for unknown
       characters and objects.  Their use and interpretation in a Legacy
       Extended IRI serves no purpose and may lead to confusing display
       variations.

(actually "byte order identification" is wrong, because that's U+FFFE; I 
have fixed that in my internal copy).

While we are at it, could you go through the list in the LEIRI section 
(http://tools.ietf.org/html/draft-ietf-iri-3987bis-13#section-6.3) as an 
easy way to cross-check whether there are any other differences?

Anyway, I have created two issues in our tracker:

http://trac.tools.ietf.org/wg/iri/trac/ticket/136
(Allow U+FFF0-FFFD to align with HTML)

http://trac.tools.ietf.org/wg/iri/trac/ticket/136
(Allow private-use characters outside query part to align with HTML)

Please feel free to add any additional information.

Personally, I'm fine either way. If somebody has implementations that 
have problems with adding these, they should speak up.

>> Consistency across formats is definitely a good thing. But there are some
>> serious differences between text and identifiers. Something that's harmless
>> in text (e.g. a zero-width space) may be hopeless in an IRI/URL (because it
>> creates a different address, leading to confusion).
>
> Unicode has lots of space for confusion. I'll note that HTML defines
> an identifier too and it takes any code point except for ASCII
> whitespace: http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute
> Incidentally for text/html, URL fragments can be used to refer to
> it...

Noted.

>> Actually, the characters that I currently would like to exclude most (not
>> just in a spec, but actually in the browser implementations) are bidi
>> control characters. RFC 3987 disallows them, but not in the syntax. Moving
>> the restrictions to the syntax would give them more prominence. Allowing
>> them in IRIs/URLs is just a wide open door for scams and phishers.
>
> I don't really have an opinion on this. I can certainly assist filing
> bugs on implementors, but I doubt they are interested in taking this
> potential compatibility hit (if I understand correctly what you're
> proposing).

Only scammers should have any reason to use these. It's way more a 
security issue (in which browsers often show a very strong interest) 
than a compatibility issue. I'll try to follow up on this in a separate 
mail, but that may not be this week, sorry.

Regards,    Martin.

P.S.: Thanks for the pointer to Rust. Very interesting project.

Received on Thursday, 8 November 2012 04:16:52 UTC