- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Fri, 27 Jan 2006 11:39:18 +0000
- To: michele vivoda <michelevivoda@hotmail.com>
- CC: uri@w3.org
I have been thinking about this too in the last week or two, and cannot
work out a decent API that captures the escaping/unescaping semantics.
My analysis is as follows:
1) The issue resolves around the reserved characters:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
2) Some of these reserved characters (usually the gen-delims) have
syntactic significance in the generic syntax. For these it is possible
to then give the unescaped form of terms from that generic syntax.
3) Under (2) we are mainly talking about components; however, in some
cases we are talking about subcomponents. For example, if a path
contains a segment which contains a "/" in unescaped form, then that "/"
must be % encoded in the URL, and it is not possible to provide an API
that treats the path as an atomic component that can be presented in
both escaped and unescaped form, because the unescaped form of
http://example.org/a/b/c/d
http://example.org/a%2Fb/c/d
http://example.org/a/b%2Fc/d
http://example.org/a%2Fb%2Fc/d
are all the same, yet the segments are different in each case.
It is possible to conceive of an API that talks about a path as an array
of strings, each being segments, in which each segment is presented in
either escaped or unescaped form.
4) the sub-delims are used both for scheme specific and application
specific semantics. So for instance, the ftp scheme reserve ';' in a
path. So in this case we would be best served by an API that explicitly
supported that, and splits the path on ';' and (re)uses a generic path
API for the part before a syntactically significant ';' and then perhaps
has a name=value API for the part after the ';'.
5) The query string is left as totally generic in the HTTP spec, but is
often used, as in your example, with a value that follows the HTML form
behaviour of a sequence of name=value pairs.
6) Perhaps the starting point is to split a URL into a sequence of
pairs, each pair consisting of a string of syntactically significant
reserved character, and a string of characters.
e.g.
http://example.org/a/b%2Fc/d
==> "" "http"
"://" "example.org"
"/" "a"
"/" "b/c"
"/" "d"
If we %-escape any reserved character from the second column then we
should be able to construct a correct URI.
However, this representation is not very useful, because it does not
reflect the semantic grouping into components. Also we will
unnecessarily %-escape many reserved characters that are not
syntactically significant in that context.
Another issue here is that within any of these components there may be
an embedded URL, which may itself have some %-escapes, which should in
turn be %-escaped!
e.g. modifying your example:
http://a/b?p1=R%26D&p2=q
If the query values:
p1 R&D
p2 q
have a third value
p3 http://a/b?p1=R%26D&p2=q
then the correct URL may be
http://a/b?p1=R%26D&p2=q&p3=http://a/b?p1%3DR%2526D%26p2%3Dq
Where the %2526 represents an & doubly encoded.
Perhaps the API design should have methods such as
String[][] URI.getQuery(String regex)
returning an array of pairs of Strings as above, where the regex maybe
something like "([^=]*=^&]*&)*([^=]*=[^&]*)" and is used to know which
terms should be escaped/unescaped. At least in this case, the same regex
and an array of just the names and values could be used to construct the
query part correctly, with the regex being used to insert the syntactic
& and =.
Jeremy
michele vivoda wrote:
>
> Hi all,
> I have a question about URIs.
>
> I was wondering if is correct thinking that an uri can be decomposed in
> components
> that can be stored in unescaped form mantaining the uri semantics, so
> the possibility
> to reconstruct from the components the (same or equivalent) uri they
> were composing.
>
> Perhaps better said the question is: can we always build an URI from
> unescaped components ?
>
> Many programming apis offer the possibility to build an uri from
> unescaped components.
> For 99% of the cases, for me, it worked good. But considering the
> following URI:
>
> http://a/b?p1=R%26D&p2=q
>
> the unescaped query component, orignally containing 2 parameters becomes:
>
> p1=R&D&p2=q
>
> loosing its meaning since now we have 3 parameters.
> My conclusion is that (at least) query component cannot be unescaped.
> Is this right, does it apply only to query or unescaped components
> should not exist at all ?
>
> Regards
> Michele Vivoda
>
>
>
>
Received on Friday, 27 January 2006 11:42:21 UTC