Re: When is percent-encoding required. from Joseph Anthony Pasquale Holsten on 2010-01-04 (uri@w3.org from January 2010)

From: Joseph Anthony Pasquale Holsten <joseph@josephholsten.com>
Date: Mon, 4 Jan 2010 15:12:01 -0600
To: uri@w3.org
Message-ID: <hhtlj0$jc8$1@ger.gmane.org>
"Charles Lindsey" <chl@clerew.man.ac.uk> said:
> Draft-ellermann-news-nntp-uri-11.txt is currently going through AUTH48  
> and, since Frank Ellermann seems not to have been heard from for more 
> than  a year, and cannot be contacted, I am getting the job of seeing 
> what needs  to be done (most notably changes necessitated by the AUTH48 
> changes in RFC  5536).

Sorry to hear Ellerman hasn't turned up. I'm glad you're pushing forward.

> I find the question of just what needs to be percent-emcoded is hard to 
>  deduce from RFC 3986. Clearly, anything in <gen-delims> MUST be  
> percent-encoded except when used as delimiters, so that agents can 
> divide  a URI into scheme, authority, path, query, and fragment 
> components even  before they recognise that it is a news or nntp URI. 
> But is it REQUIRED  for the <sub-delims> if the particular scheme does 
> not use any of them as  delimiters? RFC 3986 seems to imply not, so I 
> would expect that in
>     news:foo@bar.!#$%&'*+/=?^`{|}.example
> (yes, "bar.!#$%&'*+/=?^`{|}.example" is a valid <dot-atom-text> and 
> hence  can occur in a Message-ID) I would have to percent-encode the 
> '#'. '/' and  '?', but not the others. Frank seems to have taken the 
> view that all  <sub-delims> need to be encoded, though he does at one 
> point permit '*' to  appear unencoded (and it was indeed explicitly 
> allowed in RFC 1738), which  appears to be inconsistent wuth his stance 
> elsewhere
> 
> And he also includes an example
>     news://news.gmane.org/p0624081dc30b8699bf9b@%5B10.20.30.108%5D
> where I would have thought he could have shown
>     news://news.gmane.org/p0624081dc30b8699bf9b@[10.20.30.108]
> 
> So exactly what latitude does RFC 3986 permit in these situations?

If you do not require expressing any of reserved in your segments, you 
have no need to allow percent encoding in the definitions of those 
segments. Sub-delims don't need to be percent encoded unless you are 
using them as delimiters.

But practically, you need to write the definitions to allow 
percent-encoding in all your segments. Looking at your section 4, your 
news: syntax is quite busted. At the moment, it does not allow percent 
encoding for characters that don't have to be encoded. I can appreciate 
not wanting to allow "." to be percent encoded in mid-left, but 
mid-atext is just asking for naive implementors to build invalid news 
uris. RFC3986§2.4 explicitly mentions that, 'For example, the 
octet corresponding to the tilde ("~") character is often encoded as 
"%7E" by older URI processing implementations; the "%7E" can be 
replaced by "~" without changing its interpretation.'

IMHO, it's often not worth defining these things quite so formally at 
the URI level. I'd rather you just say that an article must (not 
necessarily completely) percent encoded representation of a usefor 
msg-id-core than the hoops you're jumping through now. Few people 
actually write their parsers to the grammars in these specs, so they'll 
usually be catching this error later in processing. I figure you're 
dealing with existing implementations, so it's a better use of time to 
point them at usefor and mention any caveats.

If you're going to be rigorous, then you'll need to define segments 
like article and newsgroups with the exact same syntax as usefor, being 
liberal in which delimiters are allowed. They should also include the 
entire range of allowable percent-encoded triplets. Then list all the 
things that they SHOULD NOT put into URIs, like percent-encoding 
something in ALPHA.

Which brings us to a higher level critique of the operational semantics 
defined by this spec. Are these URIs just for identifying articles, 
like a urn:uuid: or urn:sha1:? Should a user agent be able to retrieve 
arbitrary articles? What happens when I try to access the mythical 
<news:foo@bar.!#$%&'*+/=?^`{|}.example>? Does this refer to NNTP 
messages being sent? What are the error conditions that may be caused 
by the impedance between URIs and NNTP? I see some mention of failure 
in the security considerations, so that's good. Thinking about how the 
user agents actually handle URIs is the best guidance for writing these 
specs.

As for your two examples, both open fine in my newsreader. I'd hate for 
the spec to disagree without just cause.
-- 
Joseph Holsten
http://josephholsten.com
mailto:joseph@josephholsten.com
tel:+1-918-948-6747
Received on Monday, 4 January 2010 21:25:36 UTC