Re: are there lots of URLs that have non-UTF-8 percent-encoded octets in them?

Hello Bill,

I think the claim that there are relatively few non-UTF-8
percent-escaped octets in URIs depends on 'relatively'.

But I would like to note that the IRI spec (RFC 3987,
http://www.ietf.org/rfc/rfc3987.txt)
explicitly allows non-UTF-8-based %-encodings, because
this has always been allowed, and because pre-existing
resources should be addressable for "eternity".

I would therefore recommend you to look at the definition
of URI=>IRI conversion (Section 3.2 in RFC 3987). It is in
some sense similar to your (1)..(3) below, with the difference
that rather than breaking, it leaves bytes that can't be
converted to characters via UTF-8 escaped.

This is in particular recommended if the goal is to produce
something as readable and reusable as possible to the end user
while not being sure what encoding(s) were used.

For other purposes (e.g. server-side, where readability may
be less important), other kinds of conversions may be appropriate.

Also, in my experience with escaping and unescaping, it is very
clear for me that there is no single way of escaping or unescaping
is sufficient. As an example, when escaping a particular text
string for inclusion in an URI or IRI, the exact range of characters
to escape depends on which part the URI will be inserted in.
If it is to be escaped as a single path component, you want
to make sure you escape potential '/', but if it's to be
included as part of a path, you want to make sure you don't
escape '/', and so on. Some of this may be combined in the
same function, using additional parameters, but some other
variations may work better with additional functions.

Regards,    Martin.

At 13:05 08/08/12, janssen@parc.xerox.com wrote:
>
>Hi!
>
>What proportion of URLs that actually travel over the Web contain
>non-UTF-8 octets, percent-encoded?  Anyone have stats on that?
>
>The Python community is re-working the Python standard library API for
>the new major release of Python 3.  One of the things that is changing
>is that there will no longer be automatic coercion between sequences
>of bytes and Unicode strings.
>
>With this, we're looking at the behavior of urllib.unquote(), which is
>a function to take a string containing percent-escaped octets,
>unescape it, and return the result.  The question is, what should the
>type of the result be?  One faction is claiming that there are
>relatively few, almost no, uses for non-UTF-8 percent-escaped octets
>in URLs, so unquote should, by default, (1) create a sequence of
>bytes, and (2) create a string from them by assuming they are a
>UTF-8-encoded string, and (3) return a string value.  This would
>require little change to old naive code that assumed the Python 2 byte
>sequence was a string, but might break in the unlikely event that
>there were non-UTF-8 octets in the URL.  The other faction is claiming
>that there's no way to assume UTF-8, that a sizable proportion of URLs
>that have to be handled are, or will be, formed with non-UTF-8 octets
>in them (perhaps they are urlencoded form submissions from a 8859-1
>page, for instance), and that the default behavior for unquote should
>be to return a sequence of bytes, causing existing naive code that
>assumes a string to break, so that it can be fixed.  We'd like to know
>what the data says.
>
>Bill


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Friday, 22 August 2008 08:00:17 UTC