- From: <janssen@parc.xerox.com>
- Date: Mon, 11 Aug 2008 21:05:37 PDT
- To: www-international@w3.org
Hi! What proportion of URLs that actually travel over the Web contain non-UTF-8 octets, percent-encoded? Anyone have stats on that? The Python community is re-working the Python standard library API for the new major release of Python 3. One of the things that is changing is that there will no longer be automatic coercion between sequences of bytes and Unicode strings. With this, we're looking at the behavior of urllib.unquote(), which is a function to take a string containing percent-escaped octets, unescape it, and return the result. The question is, what should the type of the result be? One faction is claiming that there are relatively few, almost no, uses for non-UTF-8 percent-escaped octets in URLs, so unquote should, by default, (1) create a sequence of bytes, and (2) create a string from them by assuming they are a UTF-8-encoded string, and (3) return a string value. This would require little change to old naive code that assumed the Python 2 byte sequence was a string, but might break in the unlikely event that there were non-UTF-8 octets in the URL. The other faction is claiming that there's no way to assume UTF-8, that a sizable proportion of URLs that have to be handled are, or will be, formed with non-UTF-8 octets in them (perhaps they are urlencoded form submissions from a 8859-1 page, for instance), and that the default behavior for unquote should be to return a sequence of bytes, causing existing naive code that assumes a string to break, so that it can be fixed. We'd like to know what the data says. Bill
Received on Tuesday, 12 August 2008 16:20:34 UTC