RE: agenda+ Fwd: Re: language for unicode string [I18N-ACTION-800]

> >
> > Good. You might review [3].
> 
> I'd always assumed that was for exotic operations like `lowercase()`, but I see
> a term called Default Normalization Step <https://www.w3.org/TR/charmod-

> norm/#DefaultNormalizationStep> which I read as "do nothing". I assume it
> will cause more confusion to mention this than to elide it. Reasonable?

That's reasonable and what I would do. The point in charmod-norm is to have you positively decide not to normalize.

> 
> 
> > >       * Choosing character encodings: UTF-8. In JS-API, these are
> > >         interpreted as character sequences which have equivalents in
> > >         Javascript's native string format ([5]relevant tests)
> >
> > Do you have a specific pointer. The "hot spot" in here is that Javascript's
> definition [4] of String is still effectively "UCS-2 friendly". That is, it allows
> unpaired surrogate code points. These are not valid in UTF-8, although the
> encoding/decoding of isolated surrogates is straightforward. So some care
> has to be used here when specifying serialization/deserialization.
> 
> <https://github.com/WebAssembly/spec/blob/master/test/core/names.wa

> st#L1007> has scads of stuff outside BMP, e.g ˺˼𔗏𝅴𝅶𝅸𝅺⁾₎❩❫⟯﴿︶﹚)⦆
> 󠀩❳❵⟧⟩⟫⟭⦈⦊⦖⸣⸥︘︸︺︼︾﹀﹂﹄﹈﹜﹞]}」󠁝󠁽»’”›❯. (Can I claim kilo-
> scads?)

That isn't my point though. Unicode jargon is exceedingly exacting and I apologize in advance for not adding the necessary clarifiers.

Supplementary characters (that is, those beyond the BMP) are not an issue. However, isolated (that is, *unpaired*) surrogate code units are permitted in JavaScript strings. The question is how to deal with them (not allowing them would be fine by me--for security they are often replaced by U+FFFD). So the question is whether you're permitted to have a string like "\uD800 ABCDEFG \uD800\uDC00\uD800" (which starts and ends with an unpaired surrogate, but has a valid surrogate pair in the middle).

Addison

Received on Monday, 13 May 2019 23:51:30 UTC