W3C home > Mailing lists > Public > uri@w3.org > January 2010

Re: When is percent-encoding required.

From: Charles Lindsey <chl@clerew.man.ac.uk>
Date: Wed, 06 Jan 2010 17:33:48 -0000
To: URI <uri@w3.org>
Message-ID: <op.u54gemzr6hl8nm@clerew.man.ac.uk>
On Tue, 05 Jan 2010 07:12:37 -0000, Martin J. Dürst
<duerst@it.aoyama.ac.jp> wrote:

> Hello Charles,
> Bob and Joseph have said most that needs to be said. In summary, I don't  
> think there's anything wrong with respect to escaping in Frank's draft.  
> If you can point to anything specific, I'll have another look at it.  
> (see below for details)
> On 2010/01/05 3:12, Charles Lindsey wrote:

>> is it REQUIRED for the <sub-delims> if the particular scheme does not
>> use any of them as delimiters? RFC 3986 seems to imply not, so I would
>> expect that in
>> news:foo@bar.!#$%&'*+/=?^`{|}.example
>> (yes, "bar.!#$%&'*+/=?^`{|}.example" is a valid <dot-atom-text> and
>> hence can occur in a Message-ID) I would have to percent-encode the '#'.
>> '/' and '?', but not the others.
> Sorry, but you also have to encode characters that are not allowed in  
> URIs at all, i.e. '{', '}', '`', "'", "^", and "|". Bob mentioned these,  
> but wasn't very definitive and didn't give a reason. And of course, as  
> Bob mentioned, "%" has to be escaped.

Ah! I had not spotted that there existed characters that were neither
<reserved> nor <unreserved>. Does RFC 3986 have anything to say about
them, or does silence imply that encoding is needed?

Anyway, it seems the list of things needing percent encoding in that
example has now expanded to at least "#%'/?^`{|}", leaving just "!$&*+="
(which are all <sub-delims>) to discuss. Joseph Anthony seems to be saying
that these need NOT be encoded. Do you agree?
> Frank seems to have taken the view that
>> all <sub-delims> need to be encoded, though he does at one point permit
>> '*' to appear unencoded (and it was indeed explicitly allowed in RFC
>> 1738), which appears to be inconsistent wuth his stance elsewhere

And indeed it is '*' which would be a real pain if it had to be encoded,
since it is so much used in wildmats.
> I have difficulties understanding:
>     Characters not directly allowed in this part of an
>     [RFC3986] URI have to be percent-encoded, minimally anything that is
>     not <unreserved>, no ":" (colon), and doesn't belong to the
>     <sub-delims>.

Yes, that is the paragraph I was concerned with, since the mention of
colon needs to be removed because it can no longer occur in a message-id.
So I have to rewite it anyway, and there are just too many double
negatives in it at present for it to be comprehensible.
> I think this may be slightly better:
>     Characters not directly allowed in this part of an
>     [RFC3986] URI have to be percent-encoded. This at a minimum includes
>     anything that is not <unreserved>, is not a ":" (colon), and does
>     not belong to the <sub-delims>.

OK, that looks like a better basis to start from, but still has too many
'not's in it for my taste. But it is then clear that the <sub-delims> are
exempt, so that '*' is safe. Would it also be in order to say that it is
always in order to percent-encode ANYTHING (even ALPHAs) is you feel like
being awkward, in which case the meaning is always the same as if they
were decoded before interpretation? I might even include an exhaustive
list of all the ones where encoding was REQUIRED. Note that the totality
of allowed characters in a message-id is now just the <atext>s from RFC
5322, plus ".[]".

> Also, looking e.g. at
>       mid-atext       = ALPHA / DIGIT /              ; RFC 2822 <atext>
>                         "!" / "$" / "&" / "'" /      ; allowed sub-delims
>                         "*" / "+" / "=" /            ; allowed sub-delims
>                         "-" / "_" / "~" /            ; allowed unreserved
>                         "%23" / "%25" / "%2F" /      ; "#" / "%" / "/"
>                         "%3F" / "%5E" / "%60" /      ; "?" / "^" / "`"
>                         "%7B" / "%7C" / "%7D"        ; "{" / "|" / "}"

Fortunately, all those <mid-*> rules are now gone, which makes life
considerably simpler. Message-ids in news are now identical to those in
RFC 5322, except that '>' is still forbidden.

>> And he also includes an example
>> news://news.gmane.org/p0624081dc30b8699bf9b@%5B10.20.30.108%5D
>> where I would have thought he could have shown
>> news://news.gmane.org/p0624081dc30b8699bf9b@[]
> According to RFC 3986, '[' and ']' are only allowed for IPv6 addresses,  
> i.e. inside <authority>.

Ah! I had not spotted that they were <gen-delims>.

So thanks for your help. I think I can probably rewrite that paragpraph
now, and then the job is done.

Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131                       
   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Received on Wednesday, 6 January 2010 17:34:20 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:13 UTC