- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Tue, 3 Jan 2012 10:50:26 +0200
On Tue, Jan 3, 2012 at 10:33 AM, Henri Sivonen <hsivonen at iki.fi> wrote: > A solution that would border on reasonable would be decoding as > US-ASCII up to the first non-ASCII byte and then deciding between > UTF-8 and the locale-specific legacy encoding by examining the first > non-ASCII byte and up to 3 bytes after it to see if they form a valid > UTF-8 byte sequence. But trying to gain more statistical confidence > about UTF-8ness than that would be bad for performance (either due to > stalling stream processing or due to reloading). And it's worth noting that the above paragraph states a "solution" to the problem that is: "How to make it possible to use UTF-8 without declaring it?" Adding autodetection wouldn't actually force authors to use UTF-8, so the problem Faruk stated at the start of the thread (authors not using UTF-8 throughout systems that process user input) wouldn't be solved. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Tuesday, 3 January 2012 00:50:26 UTC