RE: are there lots of URLs that have non-UTF-8 percent-encoded octets in them? from Phillips, Addison on 2008-08-12 (www-international@w3.org from July to September 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 12 Aug 2008 15:26:31 -0700
To: Erik van der Poel <erikv@google.com>, "janssen@parc.xerox.com" <janssen@parc.xerox.com>
CC: "www-international@w3.org" <www-international@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014AAD5707@EX-SEA5-D.ant.amazon.com>
(personal response)

This is excellent data to have: thanks Erik.

I would hasten to point out a couple of things:

1. MSIE 6 could be set to escape the path portion using the legacy (typically non-UTF-8) page encoding. This was the default setting for some locale versions of Windows (East Asia, I believe). This option still exists (look in Internet Options, under the "advanced" tab. In MSIE 7 this is called "Send UTF-8 URLs").

2. A lot of URIs are generated at runtime---and not just the query portion. The whole idea behind REST, for example, is to have path components that represent resource items. These items can quite easily have non-ASCII names. While static HREF-style references may only be 2% non-ASCII, your "Web 2.0" application has to be ready for non-ASCII objects all the time.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: www-international-request@w3.org [mailto:www-international-
> request@w3.org] On Behalf Of Erik van der Poel
> Sent: Tuesday, August 12, 2008 1:27 PM
> To: janssen@parc.xerox.com
> Cc: www-international@w3.org
> Subject: Re: are there lots of URLs that have non-UTF-8 percent-
> encoded octets in them?
> 
> 
> Hi Bill,
> 
> URLs "travel over the Web" in a number of different directions and
> contexts, and the proportion of URLs that contain %-escaped non-
> UTF-8
> depends on the context. Around May 2007, in the context of HTML
> attribute values that normally carry URLs (e.g. "href" in the "a"
> tag), we found the following proportions in a sample of Google's
> index
> ("raw" means not %-escaped):
> 
> 1.2% non-ascii query
> 0.74% escaped non-ascii query
> 0.44% escaped non-utf-8 query
> 0.48% raw non-ascii query
> 0.44% raw non-utf-8 query
> 
> 1.1% non-ascii path
> 0.9% escaped non-ascii path
> 0.18% escaped non-utf-8 path
> 0.2% raw non-ascii path
> 0.099% raw non-utf-8 path
> 
> 0.0075% non-ascii host (including punycode)
> 0.000064% escaped non-ascii host
> 0.000032% escaped non-utf-8 host
> 0.0026% raw non-ascii host
> 0.0023% raw non-utf-8 host
> 0.002% still non-ascii after Nameprep (RFC 3491)
> 0.0054% punycode (xn--...)
> 
> It is important to note that some HTML implementations escape a raw
> query part (e.g. Firefox), while others leave them raw when sending
> the HTTP request (e.g. MSIE). So if your Python library is intended
> to
> work on the HTTP server side, it must be prepared to accept both
> raw
> and escaped query parts. Also, the query part is sent in the
> original
> encoding (of the HTML page).
> 
> Even worse, Firefox 2 converts raw non-utf-8 paths to escaped
> non-utf-8, while MSIE converts those to escaped utf-8. Thankfully,
> Firefox 3 is now aligned with MSIE.
> 
> More recently, the percentages of non-ascii query parts and path
> parts
> have increased (over 2%), but I don't have the non-utf-8 breakdown
> and
> it wasn't a very large sample. I could run it again if you're
> interested, but the bottom line is that escaped non-utf-8 is still
> quite prevalent, enough (in my opinion) to require an
> implementation
> in Python, possibly even allowing for different encodings in the
> path
> and query parts (e.g. utf-8 path and gb2312 query).
> 
> Erik
> 
> On Tue, Aug 12, 2008 at 6:05 AM,  <janssen@parc.xerox.com> wrote:
> >
> > Hi!
> >
> > What proportion of URLs that actually travel over the Web contain
> > non-UTF-8 octets, percent-encoded?  Anyone have stats on that?
> >
> > The Python community is re-working the Python standard library
> API for
> > the new major release of Python 3.  One of the things that is
> changing
> > is that there will no longer be automatic coercion between
> sequences
> > of bytes and Unicode strings.
> >
> > With this, we're looking at the behavior of urllib.unquote(),
> which is
> > a function to take a string containing percent-escaped octets,
> > unescape it, and return the result.  The question is, what should
> the
> > type of the result be?  One faction is claiming that there are
> > relatively few, almost no, uses for non-UTF-8 percent-escaped
> octets
> > in URLs, so unquote should, by default, (1) create a sequence of
> > bytes, and (2) create a string from them by assuming they are a
> > UTF-8-encoded string, and (3) return a string value.  This would
> > require little change to old naive code that assumed the Python 2
> byte
> > sequence was a string, but might break in the unlikely event that
> > there were non-UTF-8 octets in the URL.  The other faction is
> claiming
> > that there's no way to assume UTF-8, that a sizable proportion of
> URLs
> > that have to be handled are, or will be, formed with non-UTF-8
> octets
> > in them (perhaps they are urlencoded form submissions from a
> 8859-1
> > page, for instance), and that the default behavior for unquote
> should
> > be to return a sequence of bytes, causing existing naive code
> that
> > assumes a string to break, so that it can be fixed.  We'd like to
> know
> > what the data says.
> >
> > Bill
> >
> >
> >
Received on Tuesday, 12 August 2008 22:44:24 UTC