Re: agenda+ Fwd: Re: language for unicode string [I18N-ACTION-800] from Eric Prud'hommeaux on 2019-05-14 (public-i18n-core@w3.org from April to June 2019)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 14 May 2019 02:07:57 +0200
To: "Phillips, Addison" <addison@lab126.com>
Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "binji@google.com" <binji@google.com>
Message-ID: <20190514000756.GE12530@w3.org>

On Mon, May 13, 2019 at 11:51:01PM +0000, Phillips, Addison wrote:
> > >
> > > Good. You might review [3].
> > 
> > I'd always assumed that was for exotic operations like `lowercase()`, but I see
> > a term called Default Normalization Step <https://www.w3.org/TR/charmod-
> > norm/#DefaultNormalizationStep> which I read as "do nothing". I assume it
> > will cause more confusion to mention this than to elide it. Reasonable?
> 
> That's reasonable and what I would do. The point in charmod-norm is to have you positively decide not to normalize.
> 
> > 
> > 
> > > >       * Choosing character encodings: UTF-8. In JS-API, these are
> > > >         interpreted as character sequences which have equivalents in
> > > >         Javascript's native string format ([5]relevant tests)
> > >
> > > Do you have a specific pointer. The "hot spot" in here is that Javascript's
> > definition [4] of String is still effectively "UCS-2 friendly". That is, it allows
> > unpaired surrogate code points. These are not valid in UTF-8, although the
> > encoding/decoding of isolated surrogates is straightforward. So some care
> > has to be used here when specifying serialization/deserialization.
> > 
> > <https://github.com/WebAssembly/spec/blob/master/test/core/names.wa
> > st#L1007> has scads of stuff outside BMP, e.g ˺˼𔗏𝅴𝅶𝅸𝅺⁾₎❩❫⟯﴿︶﹚）｠
> > 󠀩❳❵⟧⟩⟫⟭⦈⦊⦖⸣⸥︘︸︺︼︾﹀﹂﹄﹈﹜﹞］｝｣󠁝󠁽»’”›❯. (Can I claim kilo-
> > scads?)
> 
> That isn't my point though. Unicode jargon is exceedingly exacting and I apologize in advance for not adding the necessary clarifiers.
> 
> Supplementary characters (that is, those beyond the BMP) are not an issue. However, isolated (that is, *unpaired*) surrogate code units are permitted in JavaScript strings. The question is how to deal with them (not allowing them would be fine by me--for security they are often replaced by U+FFFD). So the question is whether you're permitted to have a string like "\uD800 ABCDEFG \uD800\uDC00\uD800" (which starts and ends with an unpaired surrogate, but has a valid surrogate pair in the middle).

I'd say that
[[
Names are sequences of characters, which are scalar values as defined by Unicode (Section 2.4).
]]
says no, but I can't lay my hands on tests to make sure implementations barf on it. (Part of the problem is that WASM tests input conditions are synthesized in a browser so it may be difficult to create such a string on some platforms.)


> Addison
>

Received on Tuesday, 14 May 2019 00:08:03 UTC