- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Tue, 14 May 2019 02:07:57 +0200
- To: "Phillips, Addison" <addison@lab126.com>
- Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "binji@google.com" <binji@google.com>
On Mon, May 13, 2019 at 11:51:01PM +0000, Phillips, Addison wrote: > > > > > > Good. You might review [3]. > > > > I'd always assumed that was for exotic operations like `lowercase()`, but I see > > a term called Default Normalization Step <https://www.w3.org/TR/charmod- > > norm/#DefaultNormalizationStep> which I read as "do nothing". I assume it > > will cause more confusion to mention this than to elide it. Reasonable? > > That's reasonable and what I would do. The point in charmod-norm is to have you positively decide not to normalize. > > > > > > > > > * Choosing character encodings: UTF-8. In JS-API, these are > > > > interpreted as character sequences which have equivalents in > > > > Javascript's native string format ([5]relevant tests) > > > > > > Do you have a specific pointer. The "hot spot" in here is that Javascript's > > definition [4] of String is still effectively "UCS-2 friendly". That is, it allows > > unpaired surrogate code points. These are not valid in UTF-8, although the > > encoding/decoding of isolated surrogates is straightforward. So some care > > has to be used here when specifying serialization/deserialization. > > > > <https://github.com/WebAssembly/spec/blob/master/test/core/names.wa > > st#L1007> has scads of stuff outside BMP, e.g ˺˼𔗏⁾₎❩❫⟯﴿︶﹚)⦆ > > ❳❵⟧⟩⟫⟭⦈⦊⦖⸣⸥︘︸︺︼︾﹀﹂﹄﹈﹜﹞]}」»’”›❯. (Can I claim kilo- > > scads?) > > That isn't my point though. Unicode jargon is exceedingly exacting and I apologize in advance for not adding the necessary clarifiers. > > Supplementary characters (that is, those beyond the BMP) are not an issue. However, isolated (that is, *unpaired*) surrogate code units are permitted in JavaScript strings. The question is how to deal with them (not allowing them would be fine by me--for security they are often replaced by U+FFFD). So the question is whether you're permitted to have a string like "\uD800 ABCDEFG \uD800\uDC00\uD800" (which starts and ends with an unpaired surrogate, but has a valid surrogate pair in the middle). I'd say that [[ Names are sequences of characters, which are scalar values as defined by Unicode (Section 2.4). ]] says no, but I can't lay my hands on tests to make sure implementations barf on it. (Part of the problem is that WASM tests input conditions are synthesized in a browser so it may be difficult to create such a string on some platforms.) > Addison >
Received on Tuesday, 14 May 2019 00:08:03 UTC