RE: agenda+ Fwd: Re: language for unicode string [I18N-ACTION-800] from Phillips, Addison on 2019-05-13 (public-i18n-core@w3.org from April to June 2019)

From: Phillips, Addison <addison@lab126.com>
Date: Mon, 13 May 2019 23:51:01 +0000
To: "Eric Prud'hommeaux" <eric@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "binji@google.com" <binji@google.com>
Message-ID: <ecca62f02672454d9a8fa6032b15a89a@EX13D08UWB002.ant.amazon.com>

> >
> > Good. You might review [3].
> 
> I'd always assumed that was for exotic operations like `lowercase()`, but I see
> a term called Default Normalization Step <https://www.w3.org/TR/charmod-

> norm/#DefaultNormalizationStep> which I read as "do nothing". I assume it
> will cause more confusion to mention this than to elide it. Reasonable?

That's reasonable and what I would do. The point in charmod-norm is to have you positively decide not to normalize.

> 
> 
> > >       * Choosing character encodings: UTF-8. In JS-API, these are
> > >         interpreted as character sequences which have equivalents in
> > >         Javascript's native string format ([5]relevant tests)
> >
> > Do you have a specific pointer. The "hot spot" in here is that Javascript's
> definition [4] of String is still effectively "UCS-2 friendly". That is, it allows
> unpaired surrogate code points. These are not valid in UTF-8, although the
> encoding/decoding of isolated surrogates is straightforward. So some care
> has to be used here when specifying serialization/deserialization.
> 
> <https://github.com/WebAssembly/spec/blob/master/test/core/names.wa

> st#L1007> has scads of stuff outside BMP, e.g ˺˼𔗏𝅴𝅶𝅸𝅺⁾₎❩❫⟯﴿︶﹚）｠
> 󠀩❳❵⟧⟩⟫⟭⦈⦊⦖⸣⸥︘︸︺︼︾﹀﹂﹄﹈﹜﹞］｝｣󠁝󠁽»’”›❯. (Can I claim kilo-
> scads?)

That isn't my point though. Unicode jargon is exceedingly exacting and I apologize in advance for not adding the necessary clarifiers.

Supplementary characters (that is, those beyond the BMP) are not an issue. However, isolated (that is, *unpaired*) surrogate code units are permitted in JavaScript strings. The question is how to deal with them (not allowing them would be fine by me--for security they are often replaced by U+FFFD). So the question is whether you're permitted to have a string like "\uD800 ABCDEFG \uD800\uDC00\uD800" (which starts and ends with an unpaired surrogate, but has a valid surrogate pair in the middle).

Addison

Received on Monday, 13 May 2019 23:51:30 UTC