RE: agenda+ Fwd: Re: language for unicode string [I18N-ACTION-800]

Hello Eric,

Thanks for the note below. I have been actioned by the I18N WG with responding. I have included my actual response inter-linearly below my .sig. References here:

[1] https://www.w3.org/TR/WebIDL-1/#idl-USVString

[2] https://www.w3.org/TR/string-meta 
[3] https://www.w3.org/TR/charmod-norm 
[4] https://tc39.github.io/ecma262/#sec-ecmascript-language-types-string-type 

Let me know if you want to discuss in more detail. 

Thanks,

Addison

Addison Phillips
Sr. Principal SDE – I18N (Amazon)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.



> Subject: Re: language for unicode string
> Resent-Date: Fri, 19 Apr 2019 16:28:56 +0000
> 
> On Tue, Apr 16, 2019 at 07:17:19PM +0200, Eric Prud'hommeaux wrote:
> > WebAssembly is basically a VM spec. All communication happens through
> > Javascript (at least, that's all we're standardizing). Javascript
> > invokes WebAssembly functions via a symbol table which maps a UTF-8
> > string to an address. These strings have no interpretation beyond a
> > sequence of Unicode scalar values. For instance, there's no Unicode
> > Normalization, no parsing as case-foldable domain names, etc. Is there
> > a state-approved way to say that?

That's pretty clear. WebIDL defines "USVString" and I think that's probably what you mean? See [1]. You might mean DOMString instead. The basic problem here is whether you want to be close to JavaScript's historic use of 16-bit code unit strings (with no Unicode interpretation, e.g. isolated surrogate code points are fine) vs. more modern handling (where isolated surrogates are an error and, indeed, encoding surrogate code points is an error--a surrogate pair should be encoded in UTF-8 as a single 4-byte code point).

> >
> > Because it's a VM, it may be called upon to manipulate e.g. human
> > names, currency. In short, the subject matter may entail i18n
> > requirements but that WebAssembly doesn't know anything about the
> > subject matter and imposes no i18n requirements on it. My expectation
> > is that it would be more confusing to mention that fact than to simply
> > leave it out. Thoughts?

That's probably a good idea in most cases. I would probably place a health warning though: there is no expectation that the WASM strings will be displayed in any particular way and specifically they will not have language-sensitive rendering. Care needs to be used to avoid problems with displaying bidirectional text to avoid "spill-over" effects and to provide base direction. In most cases, since these strings are never display, this isn't probably an issues. See [2] for illustration of what we mean.

>     The only section of I18N techniques that applies to WASM is the section
>     on Characters (which apply to JSAPI's use of codepoints symbols names):
>       * Defining a Reference Processing Model: WASM uses exact string
>         comparison at the codepoint level, with no normalization.

Good. You might review [3].

>       * Choosing character encodings: UTF-8. In JS-API, these are
>         interpreted as character sequences which have equivalents in
>         Javascript's native string format ([5]relevant tests)

Do you have a specific pointer. The "hot spot" in here is that Javascript's definition [4] of String is still effectively "UCS-2 friendly". That is, it allows unpaired surrogate code points. These are not valid in UTF-8, although the encoding/decoding of isolated surrogates is straightforward. So some care has to be used here when specifying serialization/deserialization.

Received on Monday, 13 May 2019 18:43:12 UTC