Re: are there lots of URLs that have non-UTF-8 percent-encoded octets in them? from Frank Tang on 2008-08-13 (www-international@w3.org from July to September 2008)

From: Frank Tang <franktang@gmail.com>
Date: Wed, 13 Aug 2008 20:19:12 +0800
To: "Erik van der Poel" <erikv@google.com>
Cc: "Phillips, Addison" <addison@amazon.com>, "janssen@parc.xerox.com" <janssen@parc.xerox.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <2e4dfd690808130519i8d9db95sbe16c54f6b3b5787@mail.gmail.com>
I also remember the default value of such option used to be different
depending on different language version of the IE. Not sure such "different
default dpend on the languge" are still true.

On Wed, Aug 13, 2008 at 5:39 PM, Erik van der Poel <erikv@google.com> wrote:

>
> Good points, Addison. Yes, the MSIE UTF-8 option is well-known, but I
> wonder how many users actually have it set to non-UTF-8. Does anyone
> have numbers?
>
> Erik
>
> On Wed, Aug 13, 2008 at 12:26 AM, Phillips, Addison <addison@amazon.com>
> wrote:
> > (personal response)
> >
> > This is excellent data to have: thanks Erik.
> >
> > I would hasten to point out a couple of things:
> >
> > 1. MSIE 6 could be set to escape the path portion using the legacy
> (typically non-UTF-8) page encoding. This was the default setting for some
> locale versions of Windows (East Asia, I believe). This option still exists
> (look in Internet Options, under the "advanced" tab. In MSIE 7 this is
> called "Send UTF-8 URLs").
> >
> > 2. A lot of URIs are generated at runtime---and not just the query
> portion. The whole idea behind REST, for example, is to have path components
> that represent resource items. These items can quite easily have non-ASCII
> names. While static HREF-style references may only be 2% non-ASCII, your
> "Web 2.0" application has to be ready for non-ASCII objects all the time.
> >
> > Addison
> >
> > Addison Phillips
> > Globalization Architect -- Lab126
> >
> > Internationalization is not a feature.
> > It is an architecture.
> >
> >
> >> -----Original Message-----
> >> From: www-international-request@w3.org [mailto:www-international-
> >> request@w3.org] On Behalf Of Erik van der Poel
> >> Sent: Tuesday, August 12, 2008 1:27 PM
> >> To: janssen@parc.xerox.com
> >> Cc: www-international@w3.org
> >> Subject: Re: are there lots of URLs that have non-UTF-8 percent-
> >> encoded octets in them?
> >>
> >>
> >> Hi Bill,
> >>
> >> URLs "travel over the Web" in a number of different directions and
> >> contexts, and the proportion of URLs that contain %-escaped non-
> >> UTF-8
> >> depends on the context. Around May 2007, in the context of HTML
> >> attribute values that normally carry URLs (e.g. "href" in the "a"
> >> tag), we found the following proportions in a sample of Google's
> >> index
> >> ("raw" means not %-escaped):
> >>
> >> 1.2% non-ascii query
> >> 0.74% escaped non-ascii query
> >> 0.44% escaped non-utf-8 query
> >> 0.48% raw non-ascii query
> >> 0.44% raw non-utf-8 query
> >>
> >> 1.1% non-ascii path
> >> 0.9% escaped non-ascii path
> >> 0.18% escaped non-utf-8 path
> >> 0.2% raw non-ascii path
> >> 0.099% raw non-utf-8 path
> >>
> >> 0.0075% non-ascii host (including punycode)
> >> 0.000064% escaped non-ascii host
> >> 0.000032% escaped non-utf-8 host
> >> 0.0026% raw non-ascii host
> >> 0.0023% raw non-utf-8 host
> >> 0.002% still non-ascii after Nameprep (RFC 3491)
> >> 0.0054% punycode (xn--...)
> >>
> >> It is important to note that some HTML implementations escape a raw
> >> query part (e.g. Firefox), while others leave them raw when sending
> >> the HTTP request (e.g. MSIE). So if your Python library is intended
> >> to
> >> work on the HTTP server side, it must be prepared to accept both
> >> raw
> >> and escaped query parts. Also, the query part is sent in the
> >> original
> >> encoding (of the HTML page).
> >>
> >> Even worse, Firefox 2 converts raw non-utf-8 paths to escaped
> >> non-utf-8, while MSIE converts those to escaped utf-8. Thankfully,
> >> Firefox 3 is now aligned with MSIE.
> >>
> >> More recently, the percentages of non-ascii query parts and path
> >> parts
> >> have increased (over 2%), but I don't have the non-utf-8 breakdown
> >> and
> >> it wasn't a very large sample. I could run it again if you're
> >> interested, but the bottom line is that escaped non-utf-8 is still
> >> quite prevalent, enough (in my opinion) to require an
> >> implementation
> >> in Python, possibly even allowing for different encodings in the
> >> path
> >> and query parts (e.g. utf-8 path and gb2312 query).
> >>
> >> Erik
> >>
> >> On Tue, Aug 12, 2008 at 6:05 AM,  <janssen@parc.xerox.com> wrote:
> >> >
> >> > Hi!
> >> >
> >> > What proportion of URLs that actually travel over the Web contain
> >> > non-UTF-8 octets, percent-encoded?  Anyone have stats on that?
> >> >
> >> > The Python community is re-working the Python standard library
> >> API for
> >> > the new major release of Python 3.  One of the things that is
> >> changing
> >> > is that there will no longer be automatic coercion between
> >> sequences
> >> > of bytes and Unicode strings.
> >> >
> >> > With this, we're looking at the behavior of urllib.unquote(),
> >> which is
> >> > a function to take a string containing percent-escaped octets,
> >> > unescape it, and return the result.  The question is, what should
> >> the
> >> > type of the result be?  One faction is claiming that there are
> >> > relatively few, almost no, uses for non-UTF-8 percent-escaped
> >> octets
> >> > in URLs, so unquote should, by default, (1) create a sequence of
> >> > bytes, and (2) create a string from them by assuming they are a
> >> > UTF-8-encoded string, and (3) return a string value.  This would
> >> > require little change to old naive code that assumed the Python 2
> >> byte
> >> > sequence was a string, but might break in the unlikely event that
> >> > there were non-UTF-8 octets in the URL.  The other faction is
> >> claiming
> >> > that there's no way to assume UTF-8, that a sizable proportion of
> >> URLs
> >> > that have to be handled are, or will be, formed with non-UTF-8
> >> octets
> >> > in them (perhaps they are urlencoded form submissions from a
> >> 8859-1
> >> > page, for instance), and that the default behavior for unquote
> >> should
> >> > be to return a sequence of bytes, causing existing naive code
> >> that
> >> > assumes a string to break, so that it can be fixed.  We'd like to
> >> know
> >> > what the data says.
> >> >
> >> > Bill
> >> >
> >> >
> >> >
> >
> >
>
>


-- 
Frank Yung-Fong Tang 譚永鋒
Îñţérñåţîöñåļîžåţîöñ

FrankTang@gmail.com
Skype: FrankYungFongTang
Yahoo IM: FrankYungFongTan
MSN IM: FrankYungFongTang@hotmail.com
Received on Wednesday, 13 August 2008 12:19:50 UTC