RE: [QUESTION 5] Are ";" and "=" harmful characters before the "?" from Jonathan Marsh on 2007-02-23 (www-ws-desc@w3.org from February 2007)

From: Jonathan Marsh <jonathan@wso2.com>
Date: Fri, 23 Feb 2007 08:13:22 -0800
To: "'Youenn Fablet'" <youenn.fablet@crf.canon.fr>
Cc: "'keith chapman'" <keithgchapman@gmail.com>, "'www-ws-desc'" <www-ws-desc@w3.org>
Message-ID: <00fc01c75765$8ba07290$1501010a@DELLICIOUS>
below

Jonathan Marsh - http://www.wso2.com - http://auburnmarshes.spaces.live.com
 

> -----Original Message-----
> From: Youenn Fablet [mailto:youenn.fablet@crf.canon.fr]
> Sent: Friday, February 23, 2007 2:09 AM
> To: Jonathan Marsh
> Cc: 'keith chapman'; 'www-ws-desc'
> Subject: Re: [QUESTION 5] Are ";" and "=" harmful characters before the
> "?"
> 
> +1 for including '&' in the list.
> Concerning ';' and '=', leaving them in the list would let the client
> application decide whether to %-encode them or not.
> My question is then:  would the following uris be equivalent or not in
> the HTTP binding context?
>     1) http://example.org/name;v=1.1
>     2) http://example.org/name;v=1%2E1
>     3) http://example.org/name%3Bv=1.1
> 1 and 2 are clearly equivalent.
> What about 1 and 3?
> I would hope that they are also equivalent in the WSDL/HTTP binding
> context.
> According section 2.2 of rfc3986, URIs that differ in the replacement of
> a reserved character with its corresponding percent-encoded octet are
> not equivalent. I would conclude that 1 and 3 are not equivalent.

Yes.  Though if they aren't equivalent, I don't know what "let the client
application decide" means - seems like an interop hit.  I infer that the
client SHOULD NOT encode them if they result in a URI that isn't equivalent.

> I also note that ':' may be let unencoded as per the status quo.
> What would happen in the following case?
>     @whttp:location="{value}"
>     value parameter = 'urn:example.org'.
>     endpoint/@address="http://example.org/"
> The templating mechanism will produce an absolute uri "urn:example.org".
> The final request URI would then be "urn:example.org" while the
> intention might be to have something like
> "http://example.org/urn:example.org".

Yes, this is one of the many ways one could hang themselves.  I suspect
there are similarly unexpected cases that will arise if we escaped the :.
I'm not convinced we can really help here.

> The following case is also interesting:
>     @ whttp:location="{value}"
>     value parameter = ':8080'.
>     endpoint:@address="http://example.org"
> The final request URI would then be "http://example.org:8080" which may
> not be of practical use.

Is :8080 a valid relative URI?  Even if so, how does encoding the ":" make
it any more useful?

> The bad thing is that client applications that escape ':' and the ones
> that do not escape ':' may come up
> with very different request URIs. We may also run into edge cases with
> '@', see section 7.6 of rfc3986.
> 
> Reading section 2.2 of rfc3986 and with the above edge cases,
> I am currently thinking that it may be simpler and more interoperable to
> state that all URI reserved characters that appear in encoded parameters
> SHOULD be encoded.
> Users that do not want to encode them should have good reasons not to do
> that.

Do you mean they SHOULD be pre-encoded in the XML?  Or that implementations
SHOULD encode them but are conformant if they don't (which seems to be
contrary to your point that these are interop issues.)  Why not just say
MUST?

> I would also recommend that we promote this in our test-suite,
> especially in our message assertions.
> 
> Regards,
>     Youenn
> 
> 
> Jonathan Marsh wrote:
> > BTW, this includes an answer to QUESTION 4, and is recorded as issue
> CR157
> > [1].
> >
> > [1] http://www.w3.org/2002/ws/desc/5/cr-issues/#CR157.
> >
> > Jonathan Marsh - http://www.wso2.com -
> http://auburnmarshes.spaces.live.com
> >
> >
> >
> >> -----Original Message-----
> >> From: Jonathan Marsh [mailto:jonathan@wso2.com]
> >> Sent: Thursday, February 22, 2007 4:33 PM
> >> To: 'Jonathan Marsh'; 'Youenn Fablet'; 'keith chapman'
> >> Cc: 'www-ws-desc'
> >> Subject: [QUESTION 5] Are ";" and "=" harmful characters before the "?"
> >> (was: RE: LocationTemplate-1G test)
> >>
> >> Summary:
> >>  - Add "&" to the pre-? encoding rule exclusion set.
> >>  - There are lots of esoteric ways to abuse templates to create
> >>    malformed URIs.  I think we should avoid that slippery slope.
> >>
> >> Analysis:
> >>
> >> Looking again at RFC 3986 [1], a path segment is defined as:
> >>
> >>    segment       = *pchar
> >>    segment-nz    = 1*pchar
> >>    segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
> >>                  ; non-zero-length segment without any colon ":"
> >>
> >>    pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
> >>
> >>
> >>    pct-encoded   = "%" HEXDIG HEXDIG
> >>
> >>    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
> >>
> >>    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
> >>                  / "*" / "+" / "," / ";" / "="
> >>
> >>
> >> That differs from the spec that we have only in that "&" is missing in
> the
> >> spec.  I think this is an omission, and that "&" should therefore be
> added
> >> into the pre-"?" encoding list.  That takes care of making sure any
> >> character disallowed in a path by the above BNF is properly escaped.
> >> Certain forms (path-noscheme) restrict a colon, but I don't believe
> that
> >> generates an error, just changes the form.
> >>
> >> There are other possibilities for templates than just path segments
> >> though:
> >>
> >>
> >> - If one were to use it for the scheme one would have to be careful not
> to
> >> have characters other than ALPHA / DIGIT / "+" / "-" / "." appear in
> the
> >> XML
> >> data, or the scheme could be malformed.  Note that any character that
> >> resulted in %-encoding would be problematic - as %-encoding doesn't
> seem
> >> to
> >> be allowed in the scheme production either!
> >>
> >> - If one were to use it for the authority one would be unable to
> specify
> >> userinfo, which disallows "@" in order to disambiguate the @ separator
> >> between the authority and the host.
> >>
> >> - If one were to use it for the port one would be restricted to digits
> >> only
> >> (again no %-escaping accommodation).
> >>
> >> - Edge cases all the way down here.
> >>
> >> I am inclined to ignore this - if you're doing fine-grained templating
> of
> >> parts prior to the path for some reason you just have to be careful.  I
> >> don't think it's practical to try and flag every potential usage that
> can
> >> result in a mal-formed URI.
> >>
> >>
> >> The other half of the question then is whether any of the allowed
> >> characters
> >> should be escaped even though they don't interfere with the well-
> >> formedness
> >> of the path segment.
> >>
> >> The most relevant text is the last paragraph of section 3.3:
> >>
> >>    Aside from dot-segments in hierarchical paths, a path segment is
> >>    considered opaque by the generic syntax.  URI producing applications
> >>    often use the reserved characters allowed in a segment to delimit
> >>    scheme-specific or dereference-handler-specific subcomponents.  For
> >>    example, the semicolon (";") and equals ("=") reserved characters
> are
> >>    often used to delimit parameters and parameter values applicable to
> >>    that segment.  The comma (",") reserved character is often used for
> >>    similar purposes.  For example, one URI producer might use a segment
> >>    such as "name;v=1.1" to indicate a reference to version 1.1 of
> >>    "name", whereas another might use a segment such as "name,1.1" to
> >>    indicate the same.  Parameter types may be defined by scheme-
> specific
> >>    semantics, but in most cases the syntax of a parameter is specific
> to
> >>    the implementation of the URI's dereferencing algorithm.
> >>
> >> It is indeed true that a template like "name;v={version}" where version
> >> contained ";" or "=" could be difficult to work with.  But since a path
> >> segment is "considered opaque" by the generic syntax this level of
> >> checking
> >> seems overkill.  And to the extent we restrict it we'd simply force
> people
> >> to turn to raw mode to do things like "{segment}" where segment is
> >> "name;v=1.1".
> >>
> >> I don't therefore see compelling advantage in restricting characters
> that
> >> don't break the URI syntax.
> >>
> >> [1] http://www.ietf.org/rfc/rfc3986.txt
> >>
> >>
> >> Jonathan Marsh - http://www.wso2.com -
> >> http://auburnmarshes.spaces.live.com
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: www-ws-desc-request@w3.org [mailto:www-ws-desc-request@w3.org]
> On
> >>> Behalf Of Jonathan Marsh
> >>> Sent: Thursday, February 22, 2007 2:01 PM
> >>> To: 'Youenn Fablet'; 'keith chapman'
> >>> Cc: 'www-ws-desc'
> >>> Subject: RE: LocationTemplate-1G test
> >>>
> >>>
> >>> Summarizing this thread, this morning's discussion, and the related
> >>> issues:
> >>>
> >>> - [FIXED] * was improperly encoded in the baseline.
> >>>
> >>> - [QUESTION 1] The spec says what characters MUST be encoded, but
> there
> >>> are
> >>> also characters that MAY be encoded such as * (and pretty much any
> other
> >>> character except %).  Our test suite assumes only the characters that
> >>>
> >> MUST
> >>
> >>> be are.  Should we change this?  (I think we should do this
> >>> opportunistically, that is, if a testcase is proven to be correct, we
> >>> simply
> >>> add an alternative that matches that implementation's encoding
> strategy.
> >>> I
> >>> don't think we have any failures because of this at present.)
> >>>
> >>> - [AGREED] Per the last paragraph of 6.8.1, referencing section 3.1 of
> >>>
> >> RFC
> >>
> >>> 3987, some further encoding is performed after the http location
> >>>
> >> templates
> >>
> >>> are resolved and combined with the {address} property.
> >>>
> >>> - [QUESTION 2] Is this sufficiently clear in the spec?  (I think so.)
> >>>
> >>> - [AGREED] Besides the extended characters encoded above, the spec
> says
> >>> implementations SHOULD also encode "<", ">", '"', space, "{", "}",
> "|",
> >>> "\",
> >>> "^", and "`".  Our test suite will currently assume this SHOULD has
> been
> >>> followed.
> >>>
> >>> - [FIXED] There other editorial improvements such as removing the
> double
> >>> negative, reordering bullets, removing query parameter separator from
> >>> consideration before the "?".
> >>>
> >>> - [QUESTION 3] Are there additional editorial improvements possible?
> (I
> >>> think so, as reported in
> >>> http://lists.w3.org/Archives/Public/www-ws-desc/2007Feb/0193.html).
> >>>
> >>> - [QUESTION 4] Is "&" a harmful character before the "?".  If not, we
> >>> should
> >>> add it to the excluded list.
> >>>
> >>> - [QUESTION 5] Are ";" and "=" harmful characters before the "?".  If
> >>>
> >> so,
> >>
> >>> we
> >>> should remove them from the excluded list.
> >>>
> >>> I'll research proposals for 4 and 5 per my AI, but if there are any
> >>>
> >> other
> >>
> >>> questions I didn't capture here, let us know!
> >>>
> >>> Jonathan Marsh - http://www.wso2.com -
> >>> http://auburnmarshes.spaces.live.com
> >>>
> >>>
> >>>
> >
> >
> >
> >
Received on Friday, 23 February 2007 16:19:54 UTC