Re: agenda+ Fwd: Re: language for unicode string [I18N-ACTION-800] from Eric Prud'hommeaux on 2019-05-13 (public-i18n-core@w3.org from April to June 2019)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 14 May 2019 01:38:23 +0200
To: "Phillips, Addison" <addison@lab126.com>
Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, binji@google.com
Message-ID: <20190513233821.GD12530@w3.org>
On Mon, May 13, 2019 at 06:42:44PM +0000, Phillips, Addison wrote:
> Hello Eric,
> 
> Thanks for the note below. I have been actioned by the I18N WG with responding. I have included my actual response inter-linearly below my .sig. References here:
> 
> [1] https://www.w3.org/TR/WebIDL-1/#idl-USVString
> [2] https://www.w3.org/TR/string-meta
> [3] https://www.w3.org/TR/charmod-norm
> [4] https://tc39.github.io/ecma262/#sec-ecmascript-language-types-string-type
> 
> Let me know if you want to discuss in more detail. 
> 
> Thanks,
> 
> Addison
> 
> Addison Phillips
> Sr. Principal SDE – I18N (Amazon)
> Chair (W3C I18N WG)
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
> 
> > Subject: Re: language for unicode string
> > Resent-Date: Fri, 19 Apr 2019 16:28:56 +0000
> > 
> > On Tue, Apr 16, 2019 at 07:17:19PM +0200, Eric Prud'hommeaux wrote:
> > > WebAssembly is basically a VM spec. All communication happens through
> > > Javascript (at least, that's all we're standardizing). Javascript
> > > invokes WebAssembly functions via a symbol table which maps a UTF-8
> > > string to an address. These strings have no interpretation beyond a
> > > sequence of Unicode scalar values. For instance, there's no Unicode
> > > Normalization, no parsing as case-foldable domain names, etc. Is there
> > > a state-approved way to say that?
> 
> That's pretty clear. WebIDL defines "USVString" and I think that's probably what you mean? See [1]. You might mean DOMString instead. The basic problem here is whether you want to be close to JavaScript's historic use of 16-bit code unit strings (with no Unicode interpretation, e.g. isolated surrogate code points are fine) vs. more modern handling (where isolated surrogates are an error and, indeed, encoding surrogate code points is an error--a surrogate pair should be encoded in UTF-8 as a single 4-byte code point).

documenting a bit for CR review:
[[
All the stringy bits in Core are defined as Unicode Scalar Values, e.g. <https://webassembly.github.io/spec/core/bikeshed/#names��> (which unescapes \X to unicode and UTF-8-encodes it). The JS-API spec, which specifies WASM's use in a browser environment, includes IDL with USVString. [IMO, this is ideal.]

The only way strings are used as symbols is in JS-API which only sees IMPORT names, EXPORT names, and custom section names (metadata-y blocks with no semantics), in particular the NAMES section (itself custom section).

The text encoding of a WASM module includes strings <https://webassembly.github.io/spec/core/bikeshed/#strings��> which can include character escapes. Those are turned into unicode character and then encoded in UTF-8. The only visibility of this is in dev tools.
]]

> 
> > >
> > > Because it's a VM, it may be called upon to manipulate e.g. human
> > > names, currency. In short, the subject matter may entail i18n
> > > requirements but that WebAssembly doesn't know anything about the
> > > subject matter and imposes no i18n requirements on it. My expectation
> > > is that it would be more confusing to mention that fact than to simply
> > > leave it out. Thoughts?
> 
> That's probably a good idea in most cases. I would probably place a health warning though: there is no expectation that the WASM strings will be displayed in any particular way and specifically they will not have language-sensitive rendering. Care needs to be used to avoid problems with displaying bidirectional text to avoid "spill-over" effects and to provide base direction. In most cases, since these strings are never display, this isn't probably an issues. See [2] for illustration of what we mean.
> 
> >     The only section of I18N techniques that applies to WASM is the section
> >     on Characters (which apply to JSAPI's use of codepoints symbols names):
> >       * Defining a Reference Processing Model: WASM uses exact string
> >         comparison at the codepoint level, with no normalization.
> 
> Good. You might review [3].

I'd always assumed that was for exotic operations like `lowercase()`, but I see a term called Default Normalization Step <https://www.w3.org/TR/charmod-norm/#DefaultNormalizationStep> which I read as "do nothing". I assume it will cause more confusion to mention this than to elide it. Reasonable?


> >       * Choosing character encodings: UTF-8. In JS-API, these are
> >         interpreted as character sequences which have equivalents in
> >         Javascript's native string format ([5]relevant tests)
> 
> Do you have a specific pointer. The "hot spot" in here is that Javascript's definition [4] of String is still effectively "UCS-2 friendly". That is, it allows unpaired surrogate code points. These are not valid in UTF-8, although the encoding/decoding of isolated surrogates is straightforward. So some care has to be used here when specifying serialization/deserialization.

<https://github.com/WebAssembly/spec/blob/master/test/core/names.wast#L1007> has scads of stuff outside BMP, e.g ˺˼𔗏𝅴𝅶𝅸𝅺⁾₎❩❫⟯﴿︶﹚）｠󠀩❳❵⟧⟩⟫⟭⦈⦊⦖⸣⸥︘︸︺︼︾﹀﹂﹄﹈﹜﹞］｝｣󠁝󠁽»’”›❯. (Can I claim kilo-scads?)
Received on Monday, 13 May 2019 23:38:30 UTC