RE: [QUESTION 5] Are ";" and "=" harmful characters before the "?" (was: RE: LocationTemplate-1G test) from Jonathan Marsh on 2007-02-23 (www-ws-desc@w3.org from February 2007)

From: Jonathan Marsh <jonathan@wso2.com>
Date: Thu, 22 Feb 2007 16:44:03 -0800
To: "'Jonathan Marsh'" <jonathan@wso2.com>, "'Youenn Fablet'" <youenn.fablet@crf.canon.fr>, "'keith chapman'" <keithgchapman@gmail.com>
Cc: "'www-ws-desc'" <www-ws-desc@w3.org>
Message-ID: <025101c756e3$b883d180$1501010a@DELLICIOUS>
BTW, this includes an answer to QUESTION 4, and is recorded as issue CR157
[1].

[1] http://www.w3.org/2002/ws/desc/5/cr-issues/#CR157.

Jonathan Marsh - http://www.wso2.com - http://auburnmarshes.spaces.live.com
 

> -----Original Message-----
> From: Jonathan Marsh [mailto:jonathan@wso2.com]
> Sent: Thursday, February 22, 2007 4:33 PM
> To: 'Jonathan Marsh'; 'Youenn Fablet'; 'keith chapman'
> Cc: 'www-ws-desc'
> Subject: [QUESTION 5] Are ";" and "=" harmful characters before the "?"
> (was: RE: LocationTemplate-1G test)
> 
> Summary:
>  - Add "&" to the pre-? encoding rule exclusion set.
>  - There are lots of esoteric ways to abuse templates to create
>    malformed URIs.  I think we should avoid that slippery slope.
> 
> Analysis:
> 
> Looking again at RFC 3986 [1], a path segment is defined as:
> 
>    segment       = *pchar
>    segment-nz    = 1*pchar
>    segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
>                  ; non-zero-length segment without any colon ":"
> 
>    pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
> 
> 
>    pct-encoded   = "%" HEXDIG HEXDIG
> 
>    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
> 
>    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>                  / "*" / "+" / "," / ";" / "="
> 
> 
> That differs from the spec that we have only in that "&" is missing in the
> spec.  I think this is an omission, and that "&" should therefore be added
> into the pre-"?" encoding list.  That takes care of making sure any
> character disallowed in a path by the above BNF is properly escaped.
> Certain forms (path-noscheme) restrict a colon, but I don't believe that
> generates an error, just changes the form.
> 
> There are other possibilities for templates than just path segments
> though:
> 
> 
> - If one were to use it for the scheme one would have to be careful not to
> have characters other than ALPHA / DIGIT / "+" / "-" / "." appear in the
> XML
> data, or the scheme could be malformed.  Note that any character that
> resulted in %-encoding would be problematic - as %-encoding doesn't seem
> to
> be allowed in the scheme production either!
> 
> - If one were to use it for the authority one would be unable to specify
> userinfo, which disallows "@" in order to disambiguate the @ separator
> between the authority and the host.
> 
> - If one were to use it for the port one would be restricted to digits
> only
> (again no %-escaping accommodation).
> 
> - Edge cases all the way down here.
> 
> I am inclined to ignore this - if you're doing fine-grained templating of
> parts prior to the path for some reason you just have to be careful.  I
> don't think it's practical to try and flag every potential usage that can
> result in a mal-formed URI.
> 
> 
> The other half of the question then is whether any of the allowed
> characters
> should be escaped even though they don't interfere with the well-
> formedness
> of the path segment.
> 
> The most relevant text is the last paragraph of section 3.3:
> 
>    Aside from dot-segments in hierarchical paths, a path segment is
>    considered opaque by the generic syntax.  URI producing applications
>    often use the reserved characters allowed in a segment to delimit
>    scheme-specific or dereference-handler-specific subcomponents.  For
>    example, the semicolon (";") and equals ("=") reserved characters are
>    often used to delimit parameters and parameter values applicable to
>    that segment.  The comma (",") reserved character is often used for
>    similar purposes.  For example, one URI producer might use a segment
>    such as "name;v=1.1" to indicate a reference to version 1.1 of
>    "name", whereas another might use a segment such as "name,1.1" to
>    indicate the same.  Parameter types may be defined by scheme-specific
>    semantics, but in most cases the syntax of a parameter is specific to
>    the implementation of the URI's dereferencing algorithm.
> 
> It is indeed true that a template like "name;v={version}" where version
> contained ";" or "=" could be difficult to work with.  But since a path
> segment is "considered opaque" by the generic syntax this level of
> checking
> seems overkill.  And to the extent we restrict it we'd simply force people
> to turn to raw mode to do things like "{segment}" where segment is
> "name;v=1.1".
> 
> I don't therefore see compelling advantage in restricting characters that
> don't break the URI syntax.
> 
> [1] http://www.ietf.org/rfc/rfc3986.txt
> 
> 
> Jonathan Marsh - http://www.wso2.com -
> http://auburnmarshes.spaces.live.com
> 
> 
> > -----Original Message-----
> > From: www-ws-desc-request@w3.org [mailto:www-ws-desc-request@w3.org] On
> > Behalf Of Jonathan Marsh
> > Sent: Thursday, February 22, 2007 2:01 PM
> > To: 'Youenn Fablet'; 'keith chapman'
> > Cc: 'www-ws-desc'
> > Subject: RE: LocationTemplate-1G test
> >
> >
> > Summarizing this thread, this morning's discussion, and the related
> > issues:
> >
> > - [FIXED] * was improperly encoded in the baseline.
> >
> > - [QUESTION 1] The spec says what characters MUST be encoded, but there
> > are
> > also characters that MAY be encoded such as * (and pretty much any other
> > character except %).  Our test suite assumes only the characters that
> MUST
> > be are.  Should we change this?  (I think we should do this
> > opportunistically, that is, if a testcase is proven to be correct, we
> > simply
> > add an alternative that matches that implementation's encoding strategy.
> > I
> > don't think we have any failures because of this at present.)
> >
> > - [AGREED] Per the last paragraph of 6.8.1, referencing section 3.1 of
> RFC
> > 3987, some further encoding is performed after the http location
> templates
> > are resolved and combined with the {address} property.
> >
> > - [QUESTION 2] Is this sufficiently clear in the spec?  (I think so.)
> >
> > - [AGREED] Besides the extended characters encoded above, the spec says
> > implementations SHOULD also encode "<", ">", '"', space, "{", "}", "|",
> > "\",
> > "^", and "`".  Our test suite will currently assume this SHOULD has been
> > followed.
> >
> > - [FIXED] There other editorial improvements such as removing the double
> > negative, reordering bullets, removing query parameter separator from
> > consideration before the "?".
> >
> > - [QUESTION 3] Are there additional editorial improvements possible?  (I
> > think so, as reported in
> > http://lists.w3.org/Archives/Public/www-ws-desc/2007Feb/0193.html).
> >
> > - [QUESTION 4] Is "&" a harmful character before the "?".  If not, we
> > should
> > add it to the excluded list.
> >
> > - [QUESTION 5] Are ";" and "=" harmful characters before the "?".  If
> so,
> > we
> > should remove them from the excluded list.
> >
> > I'll research proposals for 4 and 5 per my AI, but if there are any
> other
> > questions I didn't capture here, let us know!
> >
> > Jonathan Marsh - http://www.wso2.com -
> > http://auburnmarshes.spaces.live.com
> >
> >
Received on Friday, 23 February 2007 00:44:01 UTC