Re: non ascii character in headers? from Joseph Holsten on 2010-03-03 (www-talk@w3.org from March to April 2010)

From: Joseph Holsten <joseph@josephholsten.com>
Date: Wed, 3 Mar 2010 11:43:45 -0600
To: Reinier Post <rp@win.tue.nl>
Cc: www-talk@w3.org
Message-Id: <989CD725-7DA3-4A53-8049-308462CF5D55@josephholsten.com>

On Mar 3, 2010, at 3:41 AM, Reinier Post wrote:

> On Tue, Mar 02, 2010 at 02:17:13PM +0100, Julian Reschke wrote:
>> On 02.03.2010 00:49, Brendan Miller wrote:
>>> I'm looking at a possible bug in my companies http handling library.
>>> The code seems to assume that there are no bytes with the higher order
>>> bit set in the http Location header. I'm thinking this will break if
>>> the Location header's URI contains non-ascii characters.
>> 
>> In which case it wouldn't be a valid URI.
>> 
>>> Is my thinking correct, or is there some rule that prohibits non-ascii
>>> chars in an http header?
>> 
>> Valid URIs never contain non-ASCII characters.
> 
> This is not true, see section 2,1 of the spec:
> 
>  http://www.ietf.org/rfc/rfc2396.txt

IRIs that contain non-ASCII characters need to define a way to be converted to a pure ASCII URI. Segments containing domain names will typically get punycoded, other segments typically encode as UTF-8 then percent-encode. Admittedly this stuff stuff is confusing and spread between a number of specs.

But you shouldn't be getting URIs with non-ASCII characters from across the wire. If you are getting high bits, you've got issues because you don't know what how the characters are encoded. Maybe it's UTF-8, but who really knows? It's a non-standard response. You might have luck treating it as a UTF-8 encoded IRI and mapping the IRI to a URI as per RFC 3987 §3.1.

If you are hitting this bug in real use, we might be able to help more by knowing what you're dealing with. Otherwise, it's perfectly fine that your code doesn't handle non-ASCII characters in a Location header.
--
Joseph Holsten
http://josephholsten.com

Received on Wednesday, 3 March 2010 17:44:23 UTC