Re: are there lots of URLs that have non-UTF-8 percent-encoded octets in them?

<janssen@parc.xerox.com> wrote:

> we're looking at the behavior of urllib.unquote(), which is
> a function to take a string containing percent-escaped octets,
> unescape it, and return the result.

Nice rathole.  Assuming all % introduce two hex. digits what
you end up is clearly a string of octets.  

If it ever was an URI, the output is likely not an URI.  So far 
no surprise, URIs are a proper subset of ASCII with a strict
syntax, non-ASCII octets immediately kill that.

The output might be an IRI.  Also no surprise, all valid IRIs
can be transformed into URIs, and where that happened it can
be inverted. 

For an UTF-8 IRI you'd expect that the "random octets" boil
down to valid UTF-8.  You could verify that.  Of course UTF-8
IRIs are always UTF-8 strings (the opposite is not true).

But it starts to get very bad when you have a perfect URI,
and end up with a garbage octet string that is anything but 
no valid IRI after percent-decoding the input URI.

It can get worse, a perfectly valid URI, decoded, resulting
in a *syntactically* valid IRI or URI, but the decoded form
does not work (404 is a harmless case).

In other words, if you have no compelling reason to "decode"
an URI leave it alone.  Only the server - assuming it is a
http URI - knows what if anything should be decoded, and how
often.  There can be %25xx double encodings.  I've not yet
seen any "treble encoding", but double encodings exist.  And
double encodings where you MUST NOT decode twice also exist,
same idea as single encodings where you MUST NOT decode once.

 Frank

Received on Wednesday, 13 August 2008 01:16:06 UTC