- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Wed, 13 Aug 2008 02:56:20 +0200
- To: www-international@w3.org
<janssen@parc.xerox.com> wrote: > we're looking at the behavior of urllib.unquote(), which is > a function to take a string containing percent-escaped octets, > unescape it, and return the result. Nice rathole. Assuming all % introduce two hex. digits what you end up is clearly a string of octets. If it ever was an URI, the output is likely not an URI. So far no surprise, URIs are a proper subset of ASCII with a strict syntax, non-ASCII octets immediately kill that. The output might be an IRI. Also no surprise, all valid IRIs can be transformed into URIs, and where that happened it can be inverted. For an UTF-8 IRI you'd expect that the "random octets" boil down to valid UTF-8. You could verify that. Of course UTF-8 IRIs are always UTF-8 strings (the opposite is not true). But it starts to get very bad when you have a perfect URI, and end up with a garbage octet string that is anything but no valid IRI after percent-decoding the input URI. It can get worse, a perfectly valid URI, decoded, resulting in a *syntactically* valid IRI or URI, but the decoded form does not work (404 is a harmless case). In other words, if you have no compelling reason to "decode" an URI leave it alone. Only the server - assuming it is a http URI - knows what if anything should be decoded, and how often. There can be %25xx double encodings. I've not yet seen any "treble encoding", but double encodings exist. And double encodings where you MUST NOT decode twice also exist, same idea as single encodings where you MUST NOT decode once. Frank
Received on Wednesday, 13 August 2008 01:16:06 UTC