Re: Dangers of non-UTF-8 Re: Details on internal encoding declarations from Julian Reschke on 2008-06-28 (public-html@w3.org from June 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sat, 28 Jun 2008 12:16:42 +0200
To: Ian Hickson <ian@hixie.ch>
CC: Alexey Proskuryakov <ap@webkit.org>, Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Message-ID: <48660F8A.3080300@gmx.de>

Ian Hickson wrote:
> Actually, while this applies to forms (and WF2 mentions it), it doesn't 
> seem to apply to regular links, where unencodable characters just get 
> turned into question marks by IE and Opera. Safari and Mozilla each do 
> their own thing (&-escape and use UTF-8 respectively) so I've gone with 
> the more interoperable IE/Opera behaviour in the spec.

According to 
<http://lists.w3.org/Archives/Public/public-html/2008Jun/0358.html>, 
Safari 3 uses question marks.

> This causes minor dataloss (the author has to go out of his way to include 
> these characters in the first place, and it's obvious in testing), but 
> it's not as bad as data corruption (there's no way for the server to know 
> on a byte-by-byte basis what encoding Mozilla's using) or data ambiguation 
> (there's no way to know if the original in "?%26%239786%3B" was a smiley 
> or the string "&#9786;", something which has affected me as a real user 
> before when I've been typing in comments and searches for strings of that 
> form, and had the server turn them into non-ASCII Unicode characters).

I would think that both data loss (IE/Safari/Opera) and what you call 
"data corruption" (FF) are bad. As a matter of fact, the latter may be 
less harmful as servers can try first UTF-8, then document encoding (and 
I know some servers already do that).

On the other hand, documenting something that is clearly broken seems to 
be the wrong approach to me, in particular as we have proof that there 
currently isn't any reliable interoperability for this edge case.

It would be interesting to know how many pages out there contain 
characters in query parts of links that aren't part of the document 
encoding. Only these would be broken if the more sane FF approach would 
be used (and these pages may *already* are broken in FF as of today).

BR, Julian

Received on Saturday, 28 June 2008 10:17:25 UTC