Re: Dangers of non-UTF-8 Re: Details on internal encoding declarations from Ian Hickson on 2008-06-28 (public-html@w3.org from June 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Sat, 28 Jun 2008 09:07:34 +0000 (UTC)
To: Alexey Proskuryakov <ap@webkit.org>
Cc: Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0806280852120.17498@hixie.dreamhostps.com>

On Fri, 23 May 2008, Alexey Proskuryakov wrote:
> On May 23, 2008, at 1:15 PM, Henri Sivonen wrote:
> > 
> > Note: When the document is not encoded as UTF-8, IRIs are not 
> > converted to URIs properly and to data loss happens in form 
> > submissions when the user enters characters that cannot be mapped to 
> > bytes using the encoding of the document.
> 
> FWIW, Firefox and Safari (not sure about IE) encode form data using 
> numeric entities in this case, so data loss doesn't happen. Not all 
> servers handle this correctly, but some do (e.g. [Google]).

Actually, while this applies to forms (and WF2 mentions it), it doesn't 
seem to apply to regular links, where unencodable characters just get 
turned into question marks by IE and Opera. Safari and Mozilla each do 
their own thing (&-escape and use UTF-8 respectively) so I've gone with 
the more interoperable IE/Opera behaviour in the spec.

This causes minor dataloss (the author has to go out of his way to include 
these characters in the first place, and it's obvious in testing), but 
it's not as bad as data corruption (there's no way for the server to know 
on a byte-by-byte basis what encoding Mozilla's using) or data ambiguation 
(there's no way to know if the original in "?%26%239786%3B" was a smiley 
or the string "&#9786;", something which has affected me as a real user 
before when I've been typing in comments and searches for strings of that 
form, and had the server turn them into non-ASCII Unicode characters).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Saturday, 28 June 2008 09:08:11 UTC