[QUESTION 5] Are ";" and "=" harmful characters before the "?" (was: RE: LocationTemplate-1G test) from Jonathan Marsh on 2007-02-23 (www-ws-desc@w3.org from February 2007)

From: Jonathan Marsh <jonathan@wso2.com>
Date: Thu, 22 Feb 2007 16:33:21 -0800
To: "'Jonathan Marsh'" <jonathan@wso2.com>, "'Youenn Fablet'" <youenn.fablet@crf.canon.fr>, "'keith chapman'" <keithgchapman@gmail.com>
Cc: "'www-ws-desc'" <www-ws-desc@w3.org>
Message-ID: <024601c756e2$3a4f1f50$1501010a@DELLICIOUS>
Summary:
 - Add "&" to the pre-? encoding rule exclusion set.
 - There are lots of esoteric ways to abuse templates to create 
   malformed URIs.  I think we should avoid that slippery slope.

Analysis:

Looking again at RFC 3986 [1], a path segment is defined as:

   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"


   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="


That differs from the spec that we have only in that "&" is missing in the
spec.  I think this is an omission, and that "&" should therefore be added
into the pre-"?" encoding list.  That takes care of making sure any
character disallowed in a path by the above BNF is properly escaped.
Certain forms (path-noscheme) restrict a colon, but I don't believe that
generates an error, just changes the form.

There are other possibilities for templates than just path segments though:


- If one were to use it for the scheme one would have to be careful not to
have characters other than ALPHA / DIGIT / "+" / "-" / "." appear in the XML
data, or the scheme could be malformed.  Note that any character that
resulted in %-encoding would be problematic - as %-encoding doesn't seem to
be allowed in the scheme production either!

- If one were to use it for the authority one would be unable to specify
userinfo, which disallows "@" in order to disambiguate the @ separator
between the authority and the host.

- If one were to use it for the port one would be restricted to digits only
(again no %-escaping accommodation).

- Edge cases all the way down here.

I am inclined to ignore this - if you're doing fine-grained templating of
parts prior to the path for some reason you just have to be careful.  I
don't think it's practical to try and flag every potential usage that can
result in a mal-formed URI.


The other half of the question then is whether any of the allowed characters
should be escaped even though they don't interfere with the well-formedness
of the path segment.

The most relevant text is the last paragraph of section 3.3:

   Aside from dot-segments in hierarchical paths, a path segment is
   considered opaque by the generic syntax.  URI producing applications
   often use the reserved characters allowed in a segment to delimit
   scheme-specific or dereference-handler-specific subcomponents.  For
   example, the semicolon (";") and equals ("=") reserved characters are
   often used to delimit parameters and parameter values applicable to
   that segment.  The comma (",") reserved character is often used for
   similar purposes.  For example, one URI producer might use a segment
   such as "name;v=1.1" to indicate a reference to version 1.1 of
   "name", whereas another might use a segment such as "name,1.1" to
   indicate the same.  Parameter types may be defined by scheme-specific
   semantics, but in most cases the syntax of a parameter is specific to
   the implementation of the URI's dereferencing algorithm.

It is indeed true that a template like "name;v={version}" where version
contained ";" or "=" could be difficult to work with.  But since a path
segment is "considered opaque" by the generic syntax this level of checking
seems overkill.  And to the extent we restrict it we'd simply force people
to turn to raw mode to do things like "{segment}" where segment is
"name;v=1.1".

I don't therefore see compelling advantage in restricting characters that
don't break the URI syntax.

[1] http://www.ietf.org/rfc/rfc3986.txt


Jonathan Marsh - http://www.wso2.com - http://auburnmarshes.spaces.live.com
 

> -----Original Message-----
> From: www-ws-desc-request@w3.org [mailto:www-ws-desc-request@w3.org] On
> Behalf Of Jonathan Marsh
> Sent: Thursday, February 22, 2007 2:01 PM
> To: 'Youenn Fablet'; 'keith chapman'
> Cc: 'www-ws-desc'
> Subject: RE: LocationTemplate-1G test
> 
> 
> Summarizing this thread, this morning's discussion, and the related
> issues:
> 
> - [FIXED] * was improperly encoded in the baseline.
> 
> - [QUESTION 1] The spec says what characters MUST be encoded, but there
> are
> also characters that MAY be encoded such as * (and pretty much any other
> character except %).  Our test suite assumes only the characters that MUST
> be are.  Should we change this?  (I think we should do this
> opportunistically, that is, if a testcase is proven to be correct, we
> simply
> add an alternative that matches that implementation's encoding strategy.
> I
> don't think we have any failures because of this at present.)
> 
> - [AGREED] Per the last paragraph of 6.8.1, referencing section 3.1 of RFC
> 3987, some further encoding is performed after the http location templates
> are resolved and combined with the {address} property.
> 
> - [QUESTION 2] Is this sufficiently clear in the spec?  (I think so.)
> 
> - [AGREED] Besides the extended characters encoded above, the spec says
> implementations SHOULD also encode "<", ">", '"', space, "{", "}", "|",
> "\",
> "^", and "`".  Our test suite will currently assume this SHOULD has been
> followed.
> 
> - [FIXED] There other editorial improvements such as removing the double
> negative, reordering bullets, removing query parameter separator from
> consideration before the "?".
> 
> - [QUESTION 3] Are there additional editorial improvements possible?  (I
> think so, as reported in
> http://lists.w3.org/Archives/Public/www-ws-desc/2007Feb/0193.html).
> 
> - [QUESTION 4] Is "&" a harmful character before the "?".  If not, we
> should
> add it to the excluded list.
> 
> - [QUESTION 5] Are ";" and "=" harmful characters before the "?".  If so,
> we
> should remove them from the excluded list.
> 
> I'll research proposals for 4 and 5 per my AI, but if there are any other
> questions I didn't capture here, let us know!
> 
> Jonathan Marsh - http://www.wso2.com -
> http://auburnmarshes.spaces.live.com
> 
>
Received on Friday, 23 February 2007 00:33:25 UTC