Re: are there lots of URLs that have non-UTF-8 percent-encoded octets in them? from Erik van der Poel on 2008-08-13 (www-international@w3.org from July to September 2008)

From: Erik van der Poel <erikv@google.com>
Date: Wed, 13 Aug 2008 11:39:09 +0200
To: "Phillips, Addison" <addison@amazon.com>
Cc: "janssen@parc.xerox.com" <janssen@parc.xerox.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <c07a32650808130239t7e654e4clf0cb8bfd1a56e212@mail.gmail.com>
Good points, Addison. Yes, the MSIE UTF-8 option is well-known, but I
wonder how many users actually have it set to non-UTF-8. Does anyone
have numbers?

Erik

On Wed, Aug 13, 2008 at 12:26 AM, Phillips, Addison <addison@amazon.com> wrote:
> (personal response)
>
> This is excellent data to have: thanks Erik.
>
> I would hasten to point out a couple of things:
>
> 1. MSIE 6 could be set to escape the path portion using the legacy (typically non-UTF-8) page encoding. This was the default setting for some locale versions of Windows (East Asia, I believe). This option still exists (look in Internet Options, under the "advanced" tab. In MSIE 7 this is called "Send UTF-8 URLs").
>
> 2. A lot of URIs are generated at runtime---and not just the query portion. The whole idea behind REST, for example, is to have path components that represent resource items. These items can quite easily have non-ASCII names. While static HREF-style references may only be 2% non-ASCII, your "Web 2.0" application has to be ready for non-ASCII objects all the time.
>
> Addison
>
> Addison Phillips
> Globalization Architect -- Lab126
>
> Internationalization is not a feature.
> It is an architecture.
>
>
>> -----Original Message-----
>> From: www-international-request@w3.org [mailto:www-international-
>> request@w3.org] On Behalf Of Erik van der Poel
>> Sent: Tuesday, August 12, 2008 1:27 PM
>> To: janssen@parc.xerox.com
>> Cc: www-international@w3.org
>> Subject: Re: are there lots of URLs that have non-UTF-8 percent-
>> encoded octets in them?
>>
>>
>> Hi Bill,
>>
>> URLs "travel over the Web" in a number of different directions and
>> contexts, and the proportion of URLs that contain %-escaped non-
>> UTF-8
>> depends on the context. Around May 2007, in the context of HTML
>> attribute values that normally carry URLs (e.g. "href" in the "a"
>> tag), we found the following proportions in a sample of Google's
>> index
>> ("raw" means not %-escaped):
>>
>> 1.2% non-ascii query
>> 0.74% escaped non-ascii query
>> 0.44% escaped non-utf-8 query
>> 0.48% raw non-ascii query
>> 0.44% raw non-utf-8 query
>>
>> 1.1% non-ascii path
>> 0.9% escaped non-ascii path
>> 0.18% escaped non-utf-8 path
>> 0.2% raw non-ascii path
>> 0.099% raw non-utf-8 path
>>
>> 0.0075% non-ascii host (including punycode)
>> 0.000064% escaped non-ascii host
>> 0.000032% escaped non-utf-8 host
>> 0.0026% raw non-ascii host
>> 0.0023% raw non-utf-8 host
>> 0.002% still non-ascii after Nameprep (RFC 3491)
>> 0.0054% punycode (xn--...)
>>
>> It is important to note that some HTML implementations escape a raw
>> query part (e.g. Firefox), while others leave them raw when sending
>> the HTTP request (e.g. MSIE). So if your Python library is intended
>> to
>> work on the HTTP server side, it must be prepared to accept both
>> raw
>> and escaped query parts. Also, the query part is sent in the
>> original
>> encoding (of the HTML page).
>>
>> Even worse, Firefox 2 converts raw non-utf-8 paths to escaped
>> non-utf-8, while MSIE converts those to escaped utf-8. Thankfully,
>> Firefox 3 is now aligned with MSIE.
>>
>> More recently, the percentages of non-ascii query parts and path
>> parts
>> have increased (over 2%), but I don't have the non-utf-8 breakdown
>> and
>> it wasn't a very large sample. I could run it again if you're
>> interested, but the bottom line is that escaped non-utf-8 is still
>> quite prevalent, enough (in my opinion) to require an
>> implementation
>> in Python, possibly even allowing for different encodings in the
>> path
>> and query parts (e.g. utf-8 path and gb2312 query).
>>
>> Erik
>>
>> On Tue, Aug 12, 2008 at 6:05 AM,  <janssen@parc.xerox.com> wrote:
>> >
>> > Hi!
>> >
>> > What proportion of URLs that actually travel over the Web contain
>> > non-UTF-8 octets, percent-encoded?  Anyone have stats on that?
>> >
>> > The Python community is re-working the Python standard library
>> API for
>> > the new major release of Python 3.  One of the things that is
>> changing
>> > is that there will no longer be automatic coercion between
>> sequences
>> > of bytes and Unicode strings.
>> >
>> > With this, we're looking at the behavior of urllib.unquote(),
>> which is
>> > a function to take a string containing percent-escaped octets,
>> > unescape it, and return the result.  The question is, what should
>> the
>> > type of the result be?  One faction is claiming that there are
>> > relatively few, almost no, uses for non-UTF-8 percent-escaped
>> octets
>> > in URLs, so unquote should, by default, (1) create a sequence of
>> > bytes, and (2) create a string from them by assuming they are a
>> > UTF-8-encoded string, and (3) return a string value.  This would
>> > require little change to old naive code that assumed the Python 2
>> byte
>> > sequence was a string, but might break in the unlikely event that
>> > there were non-UTF-8 octets in the URL.  The other faction is
>> claiming
>> > that there's no way to assume UTF-8, that a sizable proportion of
>> URLs
>> > that have to be handled are, or will be, formed with non-UTF-8
>> octets
>> > in them (perhaps they are urlencoded form submissions from a
>> 8859-1
>> > page, for instance), and that the default behavior for unquote
>> should
>> > be to return a sequence of bytes, causing existing naive code
>> that
>> > assumes a string to break, so that it can be fixed.  We'd like to
>> know
>> > what the data says.
>> >
>> > Bill
>> >
>> >
>> >
>
>
Received on Wednesday, 13 August 2008 09:39:56 UTC