Re: [QUESTION 5] Are ";" and "=" harmful characters before the "?" from Youenn Fablet on 2007-02-23 (www-ws-desc@w3.org from February 2007)

From: Youenn Fablet <youenn.fablet@crf.canon.fr>
Date: Fri, 23 Feb 2007 11:08:58 +0100
To: Jonathan Marsh <jonathan@wso2.com>
Cc: "'keith chapman'" <keithgchapman@gmail.com>, "'www-ws-desc'" <www-ws-desc@w3.org>
Message-id: <45DEBD3A.6050601@crf.canon.fr>
+1 for including '&' in the list.
Concerning ';' and '=', leaving them in the list would let the client 
application decide whether to %-encode them or not.
My question is then:  would the following uris be equivalent or not in 
the HTTP binding context?
    1) http://example.org/name;v=1.1
    2) http://example.org/name;v=1%2E1
    3) http://example.org/name%3Bv=1.1
1 and 2 are clearly equivalent.
What about 1 and 3?
I would hope that they are also equivalent in the WSDL/HTTP binding context.
According section 2.2 of rfc3986, URIs that differ in the replacement of 
a reserved character with its corresponding percent-encoded octet are 
not equivalent. I would conclude that 1 and 3 are not equivalent.

I also note that ':' may be let unencoded as per the status quo.
What would happen in the following case?
    @whttp:location="{value}"
    value parameter = 'urn:example.org'.
    endpoint/@address="http://example.org/"
The templating mechanism will produce an absolute uri "urn:example.org".
The final request URI would then be "urn:example.org" while the 
intention might be to have something like 
"http://example.org/urn:example.org".

The following case is also interesting:
    @ whttp:location="{value}"
    value parameter = ':8080'.
    endpoint:@address="http://example.org"
The final request URI would then be "http://example.org:8080" which may 
not be of practical use.

The bad thing is that client applications that escape ':' and the ones 
that do not escape ':' may come up
with very different request URIs. We may also run into edge cases with 
'@', see section 7.6 of rfc3986.

Reading section 2.2 of rfc3986 and with the above edge cases,
I am currently thinking that it may be simpler and more interoperable to 
state that all URI reserved characters that appear in encoded parameters 
SHOULD be encoded.
Users that do not want to encode them should have good reasons not to do 
that.
I would also recommend that we promote this in our test-suite, 
especially in our message assertions.

Regards,
    Youenn


Jonathan Marsh wrote:
> BTW, this includes an answer to QUESTION 4, and is recorded as issue CR157
> [1].
>
> [1] http://www.w3.org/2002/ws/desc/5/cr-issues/#CR157.
>
> Jonathan Marsh - http://www.wso2.com - http://auburnmarshes.spaces.live.com
>  
>
>   
>> -----Original Message-----
>> From: Jonathan Marsh [mailto:jonathan@wso2.com]
>> Sent: Thursday, February 22, 2007 4:33 PM
>> To: 'Jonathan Marsh'; 'Youenn Fablet'; 'keith chapman'
>> Cc: 'www-ws-desc'
>> Subject: [QUESTION 5] Are ";" and "=" harmful characters before the "?"
>> (was: RE: LocationTemplate-1G test)
>>
>> Summary:
>>  - Add "&" to the pre-? encoding rule exclusion set.
>>  - There are lots of esoteric ways to abuse templates to create
>>    malformed URIs.  I think we should avoid that slippery slope.
>>
>> Analysis:
>>
>> Looking again at RFC 3986 [1], a path segment is defined as:
>>
>>    segment       = *pchar
>>    segment-nz    = 1*pchar
>>    segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
>>                  ; non-zero-length segment without any colon ":"
>>
>>    pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
>>
>>
>>    pct-encoded   = "%" HEXDIG HEXDIG
>>
>>    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
>>
>>    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>>                  / "*" / "+" / "," / ";" / "="
>>
>>
>> That differs from the spec that we have only in that "&" is missing in the
>> spec.  I think this is an omission, and that "&" should therefore be added
>> into the pre-"?" encoding list.  That takes care of making sure any
>> character disallowed in a path by the above BNF is properly escaped.
>> Certain forms (path-noscheme) restrict a colon, but I don't believe that
>> generates an error, just changes the form.
>>
>> There are other possibilities for templates than just path segments
>> though:
>>
>>
>> - If one were to use it for the scheme one would have to be careful not to
>> have characters other than ALPHA / DIGIT / "+" / "-" / "." appear in the
>> XML
>> data, or the scheme could be malformed.  Note that any character that
>> resulted in %-encoding would be problematic - as %-encoding doesn't seem
>> to
>> be allowed in the scheme production either!
>>
>> - If one were to use it for the authority one would be unable to specify
>> userinfo, which disallows "@" in order to disambiguate the @ separator
>> between the authority and the host.
>>
>> - If one were to use it for the port one would be restricted to digits
>> only
>> (again no %-escaping accommodation).
>>
>> - Edge cases all the way down here.
>>
>> I am inclined to ignore this - if you're doing fine-grained templating of
>> parts prior to the path for some reason you just have to be careful.  I
>> don't think it's practical to try and flag every potential usage that can
>> result in a mal-formed URI.
>>
>>
>> The other half of the question then is whether any of the allowed
>> characters
>> should be escaped even though they don't interfere with the well-
>> formedness
>> of the path segment.
>>
>> The most relevant text is the last paragraph of section 3.3:
>>
>>    Aside from dot-segments in hierarchical paths, a path segment is
>>    considered opaque by the generic syntax.  URI producing applications
>>    often use the reserved characters allowed in a segment to delimit
>>    scheme-specific or dereference-handler-specific subcomponents.  For
>>    example, the semicolon (";") and equals ("=") reserved characters are
>>    often used to delimit parameters and parameter values applicable to
>>    that segment.  The comma (",") reserved character is often used for
>>    similar purposes.  For example, one URI producer might use a segment
>>    such as "name;v=1.1" to indicate a reference to version 1.1 of
>>    "name", whereas another might use a segment such as "name,1.1" to
>>    indicate the same.  Parameter types may be defined by scheme-specific
>>    semantics, but in most cases the syntax of a parameter is specific to
>>    the implementation of the URI's dereferencing algorithm.
>>
>> It is indeed true that a template like "name;v={version}" where version
>> contained ";" or "=" could be difficult to work with.  But since a path
>> segment is "considered opaque" by the generic syntax this level of
>> checking
>> seems overkill.  And to the extent we restrict it we'd simply force people
>> to turn to raw mode to do things like "{segment}" where segment is
>> "name;v=1.1".
>>
>> I don't therefore see compelling advantage in restricting characters that
>> don't break the URI syntax.
>>
>> [1] http://www.ietf.org/rfc/rfc3986.txt
>>
>>
>> Jonathan Marsh - http://www.wso2.com -
>> http://auburnmarshes.spaces.live.com
>>
>>
>>     
>>> -----Original Message-----
>>> From: www-ws-desc-request@w3.org [mailto:www-ws-desc-request@w3.org] On
>>> Behalf Of Jonathan Marsh
>>> Sent: Thursday, February 22, 2007 2:01 PM
>>> To: 'Youenn Fablet'; 'keith chapman'
>>> Cc: 'www-ws-desc'
>>> Subject: RE: LocationTemplate-1G test
>>>
>>>
>>> Summarizing this thread, this morning's discussion, and the related
>>> issues:
>>>
>>> - [FIXED] * was improperly encoded in the baseline.
>>>
>>> - [QUESTION 1] The spec says what characters MUST be encoded, but there
>>> are
>>> also characters that MAY be encoded such as * (and pretty much any other
>>> character except %).  Our test suite assumes only the characters that
>>>       
>> MUST
>>     
>>> be are.  Should we change this?  (I think we should do this
>>> opportunistically, that is, if a testcase is proven to be correct, we
>>> simply
>>> add an alternative that matches that implementation's encoding strategy.
>>> I
>>> don't think we have any failures because of this at present.)
>>>
>>> - [AGREED] Per the last paragraph of 6.8.1, referencing section 3.1 of
>>>       
>> RFC
>>     
>>> 3987, some further encoding is performed after the http location
>>>       
>> templates
>>     
>>> are resolved and combined with the {address} property.
>>>
>>> - [QUESTION 2] Is this sufficiently clear in the spec?  (I think so.)
>>>
>>> - [AGREED] Besides the extended characters encoded above, the spec says
>>> implementations SHOULD also encode "<", ">", '"', space, "{", "}", "|",
>>> "\",
>>> "^", and "`".  Our test suite will currently assume this SHOULD has been
>>> followed.
>>>
>>> - [FIXED] There other editorial improvements such as removing the double
>>> negative, reordering bullets, removing query parameter separator from
>>> consideration before the "?".
>>>
>>> - [QUESTION 3] Are there additional editorial improvements possible?  (I
>>> think so, as reported in
>>> http://lists.w3.org/Archives/Public/www-ws-desc/2007Feb/0193.html).
>>>
>>> - [QUESTION 4] Is "&" a harmful character before the "?".  If not, we
>>> should
>>> add it to the excluded list.
>>>
>>> - [QUESTION 5] Are ";" and "=" harmful characters before the "?".  If
>>>       
>> so,
>>     
>>> we
>>> should remove them from the excluded list.
>>>
>>> I'll research proposals for 4 and 5 per my AI, but if there are any
>>>       
>> other
>>     
>>> questions I didn't capture here, let us know!
>>>
>>> Jonathan Marsh - http://www.wso2.com -
>>> http://auburnmarshes.spaces.live.com
>>>
>>>
>>>       
>
>
>
>
Received on Friday, 23 February 2007 10:10:52 UTC