RE: scheme-specific length limits (issue 48) from Larry Masinter on 2011-04-03 (public-iri@w3.org from April 2011)

From: Larry Masinter <masinter@adobe.com>
Date: Sun, 3 Apr 2011 15:03:56 -0700
To: Adam Barth <ietf@adambarth.com>, "julian.reschke@gmx.de" <julian.reschke@gmx.de>
CC: Larry Masinter <masinter@adobe.com>, Noah Mendelsohn <nrm@arcanedomain.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, Ted Hardie <ted.ietf@gmail.com>, Tony Hansen <tony@att.com>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D05A06574B9@nambxv01a.corp.adobe.com>

I should have been more precise, since I only meant for characters not otherwise allowed in URIs.

This covers:

a) for characters outside 7-bit ASCII range:  scheme definitions MUST NOT distinguish between %-hex-encoded-UTF8 and unicode character 
b) for (ASCII) characters disallowed in URIs:  ...   MUST NOT distinguish ...

For characters allowed in URIs:

c) for (ASCII) unreserved characters allowed in URIs: ... SHOULD NOT distinguish ...
d) for reserved characters not syntactically significant for the scheme: ... MAY distinguish ...
e) for reserved characters when syntactically significant as reserved characters: ... MUST distinguish ...



-----Original Message-----
From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf Of Adam Barth
Sent: Sunday, April 03, 2011 1:28 PM
To: Julian Reschke
Cc: Larry Masinter; Noah Mendelsohn; Martin J. Dürst; Ted Hardie; Tony Hansen; public-iri@w3.org
Subject: Re: scheme-specific length limits (issue 48)

On Sun, Apr 3, 2011 at 1:05 PM, Julian Reschke <julian.reschke@gmx.de> wrote:
> On 03.04.2011 20:06, Adam Barth wrote:
>> On Sun, Apr 3, 2011 at 5:48 AM, Larry Masinter<masinter@adobe.com>  wrote:
>>> A scheme registration defines the syntax for URIs (IRIs) that are valid
>>> for the scheme.  A syntax definition can include limits -- that some strings
>>> are valid for the scheme and other strings are not. Those limits can be
>>> complicated, limit the repertoire of characters, be expressed in BNF, and
>>> can include length limits.
>>>
>>> Syntactic restrictions should be justified, usually by the limits of the
>>> resolution mechanism or protocol associated with a string. And we should
>>> disallow any limits (or any other syntactic restrictions) that treat %-hex
>>> encoded UTF8 characters differently than their unicode character
>>> equivalents.
>>
>> That doesn't seem correct.  For example, the http scheme treats %-hex
>> encoded UTF8 characters differently than their unicode character
>> equivalents in some cases.  Consider:
>>
>> http://example.com/foo?bar
>> http://example.com/foo%3Fbar
>>
>>> document.body.innerHTML = "<a
>>> href='http://example.com/foo%3Fbar'>boo</a>"
>>> document.body.firstChild.pathname
>>
>> "/foo%3Fbar"
>>
>>> document.body.innerHTML = "<a href='http://example.com/foo?bar'>boo</a>"
>>> document.body.firstChild.pathname
>>
>> "/foo"
>> ...
>
> No news. "?" is special in URI parsing, thus it needs to be escaped when
> it's not meant to start a query component.

Yeah, I'm not saying that behavior is surprising.  I'm saying that
Larry's requirement is violated even for very commonly used schemes.

Adam

Received on Sunday, 3 April 2011 22:05:24 UTC