Re: are there lots of URLs that have non-UTF-8 percent-encoded octets in them?

Hi Bill,

URLs "travel over the Web" in a number of different directions and
contexts, and the proportion of URLs that contain %-escaped non-UTF-8
depends on the context. Around May 2007, in the context of HTML
attribute values that normally carry URLs (e.g. "href" in the "a"
tag), we found the following proportions in a sample of Google's index
("raw" means not %-escaped):

1.2% non-ascii query
0.74% escaped non-ascii query
0.44% escaped non-utf-8 query
0.48% raw non-ascii query
0.44% raw non-utf-8 query

1.1% non-ascii path
0.9% escaped non-ascii path
0.18% escaped non-utf-8 path
0.2% raw non-ascii path
0.099% raw non-utf-8 path

0.0075% non-ascii host (including punycode)
0.000064% escaped non-ascii host
0.000032% escaped non-utf-8 host
0.0026% raw non-ascii host
0.0023% raw non-utf-8 host
0.002% still non-ascii after Nameprep (RFC 3491)
0.0054% punycode (xn--...)

It is important to note that some HTML implementations escape a raw
query part (e.g. Firefox), while others leave them raw when sending
the HTTP request (e.g. MSIE). So if your Python library is intended to
work on the HTTP server side, it must be prepared to accept both raw
and escaped query parts. Also, the query part is sent in the original
encoding (of the HTML page).

Even worse, Firefox 2 converts raw non-utf-8 paths to escaped
non-utf-8, while MSIE converts those to escaped utf-8. Thankfully,
Firefox 3 is now aligned with MSIE.

More recently, the percentages of non-ascii query parts and path parts
have increased (over 2%), but I don't have the non-utf-8 breakdown and
it wasn't a very large sample. I could run it again if you're
interested, but the bottom line is that escaped non-utf-8 is still
quite prevalent, enough (in my opinion) to require an implementation
in Python, possibly even allowing for different encodings in the path
and query parts (e.g. utf-8 path and gb2312 query).

Erik

On Tue, Aug 12, 2008 at 6:05 AM,  <janssen@parc.xerox.com> wrote:
>
> Hi!
>
> What proportion of URLs that actually travel over the Web contain
> non-UTF-8 octets, percent-encoded?  Anyone have stats on that?
>
> The Python community is re-working the Python standard library API for
> the new major release of Python 3.  One of the things that is changing
> is that there will no longer be automatic coercion between sequences
> of bytes and Unicode strings.
>
> With this, we're looking at the behavior of urllib.unquote(), which is
> a function to take a string containing percent-escaped octets,
> unescape it, and return the result.  The question is, what should the
> type of the result be?  One faction is claiming that there are
> relatively few, almost no, uses for non-UTF-8 percent-escaped octets
> in URLs, so unquote should, by default, (1) create a sequence of
> bytes, and (2) create a string from them by assuming they are a
> UTF-8-encoded string, and (3) return a string value.  This would
> require little change to old naive code that assumed the Python 2 byte
> sequence was a string, but might break in the unlikely event that
> there were non-UTF-8 octets in the URL.  The other faction is claiming
> that there's no way to assume UTF-8, that a sizable proportion of URLs
> that have to be handled are, or will be, formed with non-UTF-8 octets
> in them (perhaps they are urlencoded form submissions from a 8859-1
> page, for instance), and that the default behavior for unquote should
> be to return a sequence of bytes, causing existing naive code that
> assumes a string to break, so that it can be fixed.  We'd like to know
> what the data says.
>
> Bill
>
>
>

Received on Tuesday, 12 August 2008 20:27:27 UTC