are there lots of URLs that have non-UTF-8 percent-encoded octets in them?

Hi!

What proportion of URLs that actually travel over the Web contain
non-UTF-8 octets, percent-encoded?  Anyone have stats on that?

The Python community is re-working the Python standard library API for
the new major release of Python 3.  One of the things that is changing
is that there will no longer be automatic coercion between sequences
of bytes and Unicode strings.

With this, we're looking at the behavior of urllib.unquote(), which is
a function to take a string containing percent-escaped octets,
unescape it, and return the result.  The question is, what should the
type of the result be?  One faction is claiming that there are
relatively few, almost no, uses for non-UTF-8 percent-escaped octets
in URLs, so unquote should, by default, (1) create a sequence of
bytes, and (2) create a string from them by assuming they are a
UTF-8-encoded string, and (3) return a string value.  This would
require little change to old naive code that assumed the Python 2 byte
sequence was a string, but might break in the unlikely event that
there were non-UTF-8 octets in the URL.  The other faction is claiming
that there's no way to assume UTF-8, that a sizable proportion of URLs
that have to be handled are, or will be, formed with non-UTF-8 octets
in them (perhaps they are urlencoded form submissions from a 8859-1
page, for instance), and that the default behavior for unquote should
be to return a sequence of bytes, causing existing naive code that
assumes a string to break, so that it can be fixed.  We'd like to know
what the data says.

Bill

Received on Tuesday, 12 August 2008 16:20:34 UTC