- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Wed, 10 Nov 2010 21:33:31 -0500
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: public-webapps@w3.org
On 11/10/10 4:39 PM, Bjoern Hoehrmann wrote: > In most cases you do not need to store the bytes in order to get them > back, you can just apply the character encoding scheme used to decode > the bytes to the string and you'll have the original byte string, so > long as the character encoding scheme is bijective, which is true for > most of the relevant schemes like UTF-8 and UTF-16. Neither of those is bijective. In particular, both encoding schemes are not surjective as functions from Unicode strings onto byte streams (that is, there are such things as invalid byte sequences for both of them). Therefore they can't possibly be bijective. Specifically, invalid byte sequences typically lead to U+FFFD ending up in the Unicode string no matter what the particular values of the invalid bytes were. > like with UTF-8 encoded strings that are not-wellformed Right. See above. Note that most cases when the data is really desired as a byte array will in fact not be valid UTF-8. -Boris
Received on Thursday, 11 November 2010 02:34:05 UTC