RE: are there lots of URLs that have non-UTF-8 percent-encoded octets in them? from Phillips, Addison on 2008-08-12 (www-international@w3.org from July to September 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 12 Aug 2008 10:12:29 -0700
To: "janssen@parc.xerox.com" <janssen@parc.xerox.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014AAD50F9@EX-SEA5-D.ant.amazon.com>
(personal response)

For the authority (server name) portion of a URI, RFC 3986 is pretty clear that UTF-8 must be used for non-ASCII values (assuming, for a moment, that IDNA addresses are not Punycode encoded already). For the path portion of URIs, a large-ish proportion of them are, indeed, UTF-8 encoded because that has been the de facto standard in Web browsers for a number of years now. For the query and fragment parts, however, the encoding is determined by context and often depends on the encoding of some page that contains the form from which the data is taken. Thus, a large number of URIs contain non-UTF-8 percent-encoded octets.

I should point out that UTF-8 is fairly highly detectable. For sequences of more than two non-ASCII characters (not bytes, please note), if the bit pattern matches that of UTF-8, it is highly likely to actually be UTF-8. If it does not match the UTF-8 bit pattern, that too if usually detectable.

Detecting the character encoding used in a non-UTF-8 URI is almost impossible with any degree of accuracy. Encoding detection is normally done heuristically, with accuracy dependent on the length of the content available. Since URIs are almost universally shorter than needed to detect encoding it's unlikely that one will contain enough non-ASCII data to achieve any accuracy. Usually, the encoding of non-UTF-8 URIs are set via "prior agreement", that is, the submitter of the URI and the receiver have "agreed" (whether they know it or not) on a specific encoding. 

Personally, I would make unquote() take an encoding argument with a default of UTF-8. Catching a decoding error and trying a different encoding is a better code pattern than just handing off a stream of "decoded" bytes and letting/forcing the application to guess. If one wants the original bytes back, use ISO 8859-1 as the encoding to obtain the byte values (as characters).

I should note: some encodings are patterned (UTF-8 is one such) such that improper byte sequences will cause a decoding error (catching an error is better than getting undifferentiated bytes). However, the preponderance of encodings will map nearly any sequence of bytes to a sequence of characters. Because nearly all URIs that contain non-UTF-8 non-ASCII sequences are using "prior agreement", the developer should be able to supply an appropriate encoding in most cases. In cases where the developer cannot provide an encoding, guessing is about all that a developer can do.

Best Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

> -----Original Message-----
> From: www-international-request@w3.org [mailto:www-international-
> request@w3.org] On Behalf Of janssen@parc.xerox.com
> Sent: Monday, August 11, 2008 9:06 PM
> To: www-international@w3.org
> Subject: are there lots of URLs that have non-UTF-8 percent-encoded
> octets in them?
> 
> 
> Hi!
> 
> What proportion of URLs that actually travel over the Web contain
> non-UTF-8 octets, percent-encoded?  Anyone have stats on that?
> 
> The Python community is re-working the Python standard library API
> for
> the new major release of Python 3.  One of the things that is
> changing
> is that there will no longer be automatic coercion between
> sequences
> of bytes and Unicode strings.
> 
> With this, we're looking at the behavior of urllib.unquote(), which
> is
> a function to take a string containing percent-escaped octets,
> unescape it, and return the result.  The question is, what should
> the
> type of the result be?  One faction is claiming that there are
> relatively few, almost no, uses for non-UTF-8 percent-escaped
> octets
> in URLs, so unquote should, by default, (1) create a sequence of
> bytes, and (2) create a string from them by assuming they are a
> UTF-8-encoded string, and (3) return a string value.  This would
> require little change to old naive code that assumed the Python 2
> byte
> sequence was a string, but might break in the unlikely event that
> there were non-UTF-8 octets in the URL.  The other faction is
> claiming
> that there's no way to assume UTF-8, that a sizable proportion of
> URLs
> that have to be handled are, or will be, formed with non-UTF-8
> octets
> in them (perhaps they are urlencoded form submissions from a 8859-1
> page, for instance), and that the default behavior for unquote
> should
> be to return a sequence of bytes, causing existing naive code that
> assumes a string to break, so that it can be fixed.  We'd like to
> know
> what the data says.
> 
> Bill
>
Received on Tuesday, 12 August 2008 17:13:08 UTC