[whatwg] [URL] Starting work on a URL spec from Boris Zbarsky on 2010-07-24 (public-whatwg-archive@w3.org from July 2010)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Sat, 24 Jul 2010 02:02:39 -0400
Message-ID: <4C4A81FF.4060204@mit.edu>

On 7/24/10 1:50 AM, Brett Zamir wrote:
>> I would be particularly interested in data on this last, across
>> different browsers, operating systems, and locales... There seem to be
>> servers out there expecting their URIs in UTF-8 and others expecting
>> them in ISO-8859-1, and it's not clear to me how to make things work
>> with them all.
>
> Seems to me that if they are not in UTF-8, they should be treated as
> bugs, even if that is not a de jure standard.

Treated as bugs by whom?

The scenario is that a user types some non-ASCII text in the url bar. 
This needs to be url-encoded to actually go on the wire, which raises 
the question of what encoding.  If the user is using IRIs, the answer is 
UTF-8.  A number of servers barf if you do this, especially because some 
server-side scripting languages (PHP, e.g., last I checked) default to 
URI-unescaping via something other than UTF-8.

So some browser encode the non-query part of the URI as UTF-8 and the 
query part as ... something (user's default filesystem encoding, say, 
for lack of a better guess).  Others always use UTF-8 (and end up with 
some servers not usable).  Others... I have no idea.  That's why I want 
data.  ;)  In particular, while the "just use UTF-8, and if the user 
can't access the site sucks to be the user" approach has a certain 
theoretical-purity appeal, it doesn't seem like something I want to do 
to my friends and family (always a good criterion for things you'd like 
to do to users).

-Boris

Received on Friday, 23 July 2010 23:02:39 UTC