W3C home > Mailing lists > Public > uri@w3.org > January 2010

Re: When is percent-encoding required.

From: Tom Petch <nwnetworks@dial.pipex.com>
Date: Fri, 29 Jan 2010 18:49:08 +0100
Message-ID: <002601caa10b$5eb1c020$0601a8c0@allison>
To: "URI" <uri@w3.org>, "Charles Lindsey" <chl@clerew.man.ac.uk>
---- Original Message -----
From: "Charles Lindsey" <chl@clerew.man.ac.uk>
Sent: Friday, January 15, 2010 6:16 PM

> On Wed, 13 Jan 2010 18:09:50 -0000, Julien ÉLIE <julien@trigofacile.com>
> wrote:
>
> > Hi Charles,
> >
> >> Here is the wording I now propose:
> >>
> >> According to [RFC 3968], characters that are in <gen-delims> (a subset
> >> of  <reserved>) MUST be percent-encoded (though it is not wrong to
> >> encode  others). Specifically, the characters allowed in <msg-id-core>
> >> that must  be encoded are
> >>     "/"  "?"  "#"  "[" and "]"
> >> Note that an agent which seeks to interpret a 'news' URI needs to
> >> decode  all these percent-encoded characters before passing it on to an
> >> NNTP  server to be acted upon.
> >>
> >> Comments anyone?
> >
> > MUSTn't "%" also be encoded?
>
> Ah yes! That pesky '%' which, for some strange reason, is not included in
> <gen-delims>
> >
> > I see in to-be RFC 5538:
> >
> >      mid-left        = 1*( mid-atext / "." ) /      ; <dot-atom-text>
> >                        ( "%22" mid-quote "%22" )    ; <no-fold-quote>
> >      mid-right       = 1*( mid-atext / "." ) /      ; <dot-atom-text>
> >                        ( "%5B" mid-literal "%5D" )  ; <no-fold-literal>
> >      mid-atext       = ALPHA / DIGIT /              ; RFC 2822 <atext>
> >                        "!" / "$" / "&" / "'" /      ; allowed sub-delims
> >                        "*" / "+" / "=" /            ; allowed sub-delims
> >                        "-" / "_" / "~" /            ; allowed unreserved
> >                        "%23" / "%25" / "%2F" /      ; "#" / "%" / "/"
> >                        "%3F" / "%5E" / "%60" /      ; "?" / "^" / "`"
> >                        "%7B" / "%7C" / "%7D"        ; "{" / "|" / "}"
> >
> well the final form of RFC 5538 is reverting to the <msg-id-core> syntax
> of RFC 5537. So the cases we are actually interested in is the
> intersection of (<gen-delims> plus '%') with <atext>. But that indeed does
> inlcude '%'.
>
> > but if I have a message-ID that contains "%23", isn't is mandatory to
> > convert it into "%2523" (URI)?
>
> But of course "%23" is not in <atext>, whatever nonsense we might have had
> in <mid-atext>.
>
> So here is another attempt at my wording:
>
> According to [RFC 3968], characters that are in <gen-delims> (a subset
> of  <reserved>), together with the character "%", MUST be percent-encoded
> (though it is not wrong to encode  others).

Apologies for coming to this so late but I do not think that this statement
should pass unchallenged. One known exception is the use of [ and ] in IPv6
addresses and there could be others.

I do not find RFC3968 an easy read but careful study suggests that what it says
is

a) No character may appear in a URI unless there is an ABNF rule saying that it
may and, at most, that character set is limited to reserved and unreserved.

b) URIs which differ in having a reserved character percent encoded or not are
not equivalent.

So a scheme can require reserved characters within a component to be
percent-encoded but that then becomes a MUST for that scheme, else you do not
have interoperability.

In describing the treatment of characters, the RFC makes no distinction between
the two subsets of reserved (gen-delims and sub-delims).  The only difference is
an administrative one, that the subset known as sub-delims appears as a set in
several of the rules for data in components making them explicitly allowed.  A
specific scheme can take a different view (eg requiring all sub-delims to be
percent-encoded within the data of a component).

Tom Petch

>                                                          Specifically, the
characters
> allowed in <msg-id-core>
> that must  be encoded are
>      "/"  "?"  "#"  "[" "]" and "%"
> Note that an agent which seeks to interpret a 'news' URI needs to
> decode  all these percent-encoded characters before passing it on to an
> NNTP  server to be acted upon.
>
> --
> Charles H. Lindsey
Received on Friday, 29 January 2010 18:50:08 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:14 UTC