Re: language for unicode string from Eric Prud'hommeaux on 2019-04-19 (public-i18n@w3.org from April 2019)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 19 Apr 2019 18:28:50 +0200
To: public-i18n@w3.org
Message-ID: <20190419162849.GJ6515@w3.org>
On Tue, Apr 16, 2019 at 07:17:19PM +0200, Eric Prud'hommeaux wrote:
> WebAssembly is basically a VM spec. All communication happens through
> Javascript (at least, that's all we're standardizing). Javascript
> invokes WebAssembly functions via a symbol table which maps a UTF-8
> string to an address. These strings have no interpretation beyond a
> sequence of Unicode scalar values. For instance, there's no Unicode
> Normalization, no parsing as case-foldable domain names, etc. Is there
> a state-approved way to say that?
> 
> Because it's a VM, it may be called upon to manipulate e.g. human
> names, currency. In short, the subject matter may entail i18n
> requirements but that WebAssembly doesn't know anything about the
> subject matter and imposes no i18n requirements on it. My expectation
> is that it would be more confusing to mention that fact than to simply
> leave it out. Thoughts?
> 
> If EcmaScript had sections for I18N and Security Considerations, I
> could just copy them. Can anyone think of something else I could copy
> from?

In case it helps, here are the answers to the Internationalization
techniques[4]:

     * places where characters are used in WASM are specifically not
       natural language:
          + symbol imports
          + symbol exports
          + name section (mapping from index to symbol)
     * all of these allow all legal UTF-8, including U+0 (UTF-16 surrogate
       pairs specfically not allowed)

   The only section of I18N techniques that applies to WASM is the section
   on Characters (which apply to JSAPI's use of codepoints symbols names):
     * Defining a Reference Processing Model: WASM uses exact string
       comparison at the codepoint level, with no normalization.
     * Including and excluding character ranges: no excluded character
       ranges
     * Using the Private Use Area: WASM symbols may use private use areas.
     * Choosing character encodings: UTF-8. In JS-API, these are
       interpreted as character sequences which have equivalents in
       Javascript's native string format ([5]relevant tests)
     * Identifying character encodings: only one is allowed.
     * Designing character escapes: the WASM text format includes escapes
       necessary to be unambiguous in that grammar.
     * Storing text: no text is stored except as symbols or as the WASM
       text format.
     * Specifying sort and search functionality: no search or sort
     * Converting to a Common Unicode Form: no normalization
     * Handling Case Folding: no case folding
     * Defining 'string': no strings, just length-delimited codepoint
       sequences (U+0 is permitted)
     * Indexing strings: no strings
     * Referring to Unicode characters: no references
     * Referencing the Unicode Standard: follows
       [6]https://www.w3.org/TR/charmod/#sec-RefUnicode


References

   Visible links:
   4. https://www.w3.org/International/techniques/developing-specs?collapse
   5. https://github.com/WebAssembly/spec/blob/master/test/core/names.wast
   6. https://www.w3.org/TR/charmod/#sec-RefUnicode
-- 
-eric

office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Friday, 19 April 2019 16:28:55 UTC