- From: Phillips, Addison <addison@amazon.com>
- Date: Tue, 12 Aug 2008 15:26:31 -0700
- To: Erik van der Poel <erikv@google.com>, "janssen@parc.xerox.com" <janssen@parc.xerox.com>
- CC: "www-international@w3.org" <www-international@w3.org>
(personal response) This is excellent data to have: thanks Erik. I would hasten to point out a couple of things: 1. MSIE 6 could be set to escape the path portion using the legacy (typically non-UTF-8) page encoding. This was the default setting for some locale versions of Windows (East Asia, I believe). This option still exists (look in Internet Options, under the "advanced" tab. In MSIE 7 this is called "Send UTF-8 URLs"). 2. A lot of URIs are generated at runtime---and not just the query portion. The whole idea behind REST, for example, is to have path components that represent resource items. These items can quite easily have non-ASCII names. While static HREF-style references may only be 2% non-ASCII, your "Web 2.0" application has to be ready for non-ASCII objects all the time. Addison Addison Phillips Globalization Architect -- Lab126 Internationalization is not a feature. It is an architecture. > -----Original Message----- > From: www-international-request@w3.org [mailto:www-international- > request@w3.org] On Behalf Of Erik van der Poel > Sent: Tuesday, August 12, 2008 1:27 PM > To: janssen@parc.xerox.com > Cc: www-international@w3.org > Subject: Re: are there lots of URLs that have non-UTF-8 percent- > encoded octets in them? > > > Hi Bill, > > URLs "travel over the Web" in a number of different directions and > contexts, and the proportion of URLs that contain %-escaped non- > UTF-8 > depends on the context. Around May 2007, in the context of HTML > attribute values that normally carry URLs (e.g. "href" in the "a" > tag), we found the following proportions in a sample of Google's > index > ("raw" means not %-escaped): > > 1.2% non-ascii query > 0.74% escaped non-ascii query > 0.44% escaped non-utf-8 query > 0.48% raw non-ascii query > 0.44% raw non-utf-8 query > > 1.1% non-ascii path > 0.9% escaped non-ascii path > 0.18% escaped non-utf-8 path > 0.2% raw non-ascii path > 0.099% raw non-utf-8 path > > 0.0075% non-ascii host (including punycode) > 0.000064% escaped non-ascii host > 0.000032% escaped non-utf-8 host > 0.0026% raw non-ascii host > 0.0023% raw non-utf-8 host > 0.002% still non-ascii after Nameprep (RFC 3491) > 0.0054% punycode (xn--...) > > It is important to note that some HTML implementations escape a raw > query part (e.g. Firefox), while others leave them raw when sending > the HTTP request (e.g. MSIE). So if your Python library is intended > to > work on the HTTP server side, it must be prepared to accept both > raw > and escaped query parts. Also, the query part is sent in the > original > encoding (of the HTML page). > > Even worse, Firefox 2 converts raw non-utf-8 paths to escaped > non-utf-8, while MSIE converts those to escaped utf-8. Thankfully, > Firefox 3 is now aligned with MSIE. > > More recently, the percentages of non-ascii query parts and path > parts > have increased (over 2%), but I don't have the non-utf-8 breakdown > and > it wasn't a very large sample. I could run it again if you're > interested, but the bottom line is that escaped non-utf-8 is still > quite prevalent, enough (in my opinion) to require an > implementation > in Python, possibly even allowing for different encodings in the > path > and query parts (e.g. utf-8 path and gb2312 query). > > Erik > > On Tue, Aug 12, 2008 at 6:05 AM, <janssen@parc.xerox.com> wrote: > > > > Hi! > > > > What proportion of URLs that actually travel over the Web contain > > non-UTF-8 octets, percent-encoded? Anyone have stats on that? > > > > The Python community is re-working the Python standard library > API for > > the new major release of Python 3. One of the things that is > changing > > is that there will no longer be automatic coercion between > sequences > > of bytes and Unicode strings. > > > > With this, we're looking at the behavior of urllib.unquote(), > which is > > a function to take a string containing percent-escaped octets, > > unescape it, and return the result. The question is, what should > the > > type of the result be? One faction is claiming that there are > > relatively few, almost no, uses for non-UTF-8 percent-escaped > octets > > in URLs, so unquote should, by default, (1) create a sequence of > > bytes, and (2) create a string from them by assuming they are a > > UTF-8-encoded string, and (3) return a string value. This would > > require little change to old naive code that assumed the Python 2 > byte > > sequence was a string, but might break in the unlikely event that > > there were non-UTF-8 octets in the URL. The other faction is > claiming > > that there's no way to assume UTF-8, that a sizable proportion of > URLs > > that have to be handled are, or will be, formed with non-UTF-8 > octets > > in them (perhaps they are urlencoded form submissions from a > 8859-1 > > page, for instance), and that the default behavior for unquote > should > > be to return a sequence of bytes, causing existing naive code > that > > assumes a string to break, so that it can be fixed. We'd like to > know > > what the data says. > > > > Bill > > > > > >
Received on Tuesday, 12 August 2008 22:44:24 UTC