> If you happen to want to interpret
them as UTF-16, you are free to do so, but there is not and never will
be any guarantee that all strings are well-formed UTF-16.
You never have that guarantee, any more than you have the guarantee that a
source purporting to be UTF-8 is in fact well formed. All conscientious
recipients need to check the data -- *if* they are sensitive to ill-formed
text. Luckily, the impact of ill-formed UTF-16 is vastly less than that of
ill-formed UTF-8.
Mark
On Fri, Oct 30, 2009 at 17:47, John Cowan <cowan@ccil.org> wrote:
> Phillips, Addison scripsit:
>
> > ECMAScript's "firm commitment" to a 16-bit character model (i.e. UTF-16)
>
> If only.
>
> JavaScript and JSON strings aren't sequences of characters, they are
> sequences of 16-bit unsigned integers. If you happen to want to interpret
> them as UTF-16, you are free to do so, but there is not and never will
> be any guarantee that all strings are well-formed UTF-16. What's more,
> the built-in JSON serializer provided by ECMAScript 5th edition does
> not generate escape sequences for isolated surrogate codepoints, so that
> some strings will be written out in CESU-8 rather than UTF-8.
>
> Worse yet, the JSON RFC is self-contradictory, with the result that it's
> not even clear that CESU-8-encoded JSON is illegal.
>
> --
> Let's face it: software is crap. Feature-laden and bloated, written under
> tremendous time-pressure, often by incapable coders, using dangerous
> languages and inadequate tools, trying to connect to heaps of broken or
> obsolete protocols, implemented equally insufficiently, running on
> unpredictable hardware -- we are all more than used to brokenness.
> --Felix Winkelmann
>