- From: Erik van der Poel <erikv@google.com>
- Date: Wed, 13 Aug 2008 11:39:09 +0200
- To: "Phillips, Addison" <addison@amazon.com>
- Cc: "janssen@parc.xerox.com" <janssen@parc.xerox.com>, "www-international@w3.org" <www-international@w3.org>
Good points, Addison. Yes, the MSIE UTF-8 option is well-known, but I wonder how many users actually have it set to non-UTF-8. Does anyone have numbers? Erik On Wed, Aug 13, 2008 at 12:26 AM, Phillips, Addison <addison@amazon.com> wrote: > (personal response) > > This is excellent data to have: thanks Erik. > > I would hasten to point out a couple of things: > > 1. MSIE 6 could be set to escape the path portion using the legacy (typically non-UTF-8) page encoding. This was the default setting for some locale versions of Windows (East Asia, I believe). This option still exists (look in Internet Options, under the "advanced" tab. In MSIE 7 this is called "Send UTF-8 URLs"). > > 2. A lot of URIs are generated at runtime---and not just the query portion. The whole idea behind REST, for example, is to have path components that represent resource items. These items can quite easily have non-ASCII names. While static HREF-style references may only be 2% non-ASCII, your "Web 2.0" application has to be ready for non-ASCII objects all the time. > > Addison > > Addison Phillips > Globalization Architect -- Lab126 > > Internationalization is not a feature. > It is an architecture. > > >> -----Original Message----- >> From: www-international-request@w3.org [mailto:www-international- >> request@w3.org] On Behalf Of Erik van der Poel >> Sent: Tuesday, August 12, 2008 1:27 PM >> To: janssen@parc.xerox.com >> Cc: www-international@w3.org >> Subject: Re: are there lots of URLs that have non-UTF-8 percent- >> encoded octets in them? >> >> >> Hi Bill, >> >> URLs "travel over the Web" in a number of different directions and >> contexts, and the proportion of URLs that contain %-escaped non- >> UTF-8 >> depends on the context. Around May 2007, in the context of HTML >> attribute values that normally carry URLs (e.g. "href" in the "a" >> tag), we found the following proportions in a sample of Google's >> index >> ("raw" means not %-escaped): >> >> 1.2% non-ascii query >> 0.74% escaped non-ascii query >> 0.44% escaped non-utf-8 query >> 0.48% raw non-ascii query >> 0.44% raw non-utf-8 query >> >> 1.1% non-ascii path >> 0.9% escaped non-ascii path >> 0.18% escaped non-utf-8 path >> 0.2% raw non-ascii path >> 0.099% raw non-utf-8 path >> >> 0.0075% non-ascii host (including punycode) >> 0.000064% escaped non-ascii host >> 0.000032% escaped non-utf-8 host >> 0.0026% raw non-ascii host >> 0.0023% raw non-utf-8 host >> 0.002% still non-ascii after Nameprep (RFC 3491) >> 0.0054% punycode (xn--...) >> >> It is important to note that some HTML implementations escape a raw >> query part (e.g. Firefox), while others leave them raw when sending >> the HTTP request (e.g. MSIE). So if your Python library is intended >> to >> work on the HTTP server side, it must be prepared to accept both >> raw >> and escaped query parts. Also, the query part is sent in the >> original >> encoding (of the HTML page). >> >> Even worse, Firefox 2 converts raw non-utf-8 paths to escaped >> non-utf-8, while MSIE converts those to escaped utf-8. Thankfully, >> Firefox 3 is now aligned with MSIE. >> >> More recently, the percentages of non-ascii query parts and path >> parts >> have increased (over 2%), but I don't have the non-utf-8 breakdown >> and >> it wasn't a very large sample. I could run it again if you're >> interested, but the bottom line is that escaped non-utf-8 is still >> quite prevalent, enough (in my opinion) to require an >> implementation >> in Python, possibly even allowing for different encodings in the >> path >> and query parts (e.g. utf-8 path and gb2312 query). >> >> Erik >> >> On Tue, Aug 12, 2008 at 6:05 AM, <janssen@parc.xerox.com> wrote: >> > >> > Hi! >> > >> > What proportion of URLs that actually travel over the Web contain >> > non-UTF-8 octets, percent-encoded? Anyone have stats on that? >> > >> > The Python community is re-working the Python standard library >> API for >> > the new major release of Python 3. One of the things that is >> changing >> > is that there will no longer be automatic coercion between >> sequences >> > of bytes and Unicode strings. >> > >> > With this, we're looking at the behavior of urllib.unquote(), >> which is >> > a function to take a string containing percent-escaped octets, >> > unescape it, and return the result. The question is, what should >> the >> > type of the result be? One faction is claiming that there are >> > relatively few, almost no, uses for non-UTF-8 percent-escaped >> octets >> > in URLs, so unquote should, by default, (1) create a sequence of >> > bytes, and (2) create a string from them by assuming they are a >> > UTF-8-encoded string, and (3) return a string value. This would >> > require little change to old naive code that assumed the Python 2 >> byte >> > sequence was a string, but might break in the unlikely event that >> > there were non-UTF-8 octets in the URL. The other faction is >> claiming >> > that there's no way to assume UTF-8, that a sizable proportion of >> URLs >> > that have to be handled are, or will be, formed with non-UTF-8 >> octets >> > in them (perhaps they are urlencoded form submissions from a >> 8859-1 >> > page, for instance), and that the default behavior for unquote >> should >> > be to return a sequence of bytes, causing existing naive code >> that >> > assumes a string to break, so that it can be fixed. We'd like to >> know >> > what the data says. >> > >> > Bill >> > >> > >> > > >
Received on Wednesday, 13 August 2008 09:39:56 UTC