[presentation-api] Possibility for a character to be interpreted differently depending on locale from François Daoust via GitHub on 2015-11-06 (public-secondscreen@w3.org from November 2015)

From: François Daoust via GitHub <sysbot+gh@w3.org>
Date: Fri, 06 Nov 2015 15:24:13 +0000
To: public-secondscreen@w3.org
Message-ID: <issues.opened-115525100-1446823452-sysbot+gh@w3.org>
tidoust has just created a new issue for 
https://github.com/w3c/presentation-api:

== Possibility for a character to be interpreted differently depending
 on locale ==
Hi all,

[I'm raising this as an issue for tracking purpose, but I do not think
 that there is any issue in the end, so actually suggests to close it 
once people have reviewed it, unless someone points out that I missed 
something, of course!].

The group discussed potential internationalization issues during its 
F2F last week [1]. One particular issue that was raised was the 
possibility for a given character in a JS string to be interpreted 
differently by both browsing contexts, meaning it would be represented
 by different glyphs with different meaning in a Japanese, Chinese 
and/or Korean environment (e.g. a glyph meaning "Yen" in Japanese and 
a different currency in Chinese).

I said that I believed this was impossible in practice, and took an 
action "to investigate the possibility for a JS string to be rendered 
differently by different glyphs and locales".

I had a quick chat with @r12a (Richard Ishida), internationalization 
activity lead at W3C, and confirm that, unless I missed something in 
the scenario presented below, such a problem should never ever happen.
 The Presentation API operates on JavaScript types which are not 
affected by the character encoding used to retrieve the content.

The hypothetical scenario where a problem could have happened was 
something like:

1. the app running on the controlling browsing context is served 
encoded in Shift_JIS (usual encoding for Japanese characters);
2. the app running on the receiving browsing context is served encoded
 in Big5 (usual encoding for Chinese characters);
3. the app on the controlling browsing context extracts a string from 
a DOM element
4. the app on the controlling browsing context sends that string over 
to the receiving browsing context using the Presentation API's "send" 
method;
5. the app on the receiving browsing context sets the received value 
to a DOM element;
6. the characters rendered on the receiving browsing context for that 
DOM element mean something different.

In particular, regardless of the encoding used to serve a page/app, 
extracting a string from a DOM element returns a DOMString [2], which 
is an UTF-16 encoded serialization of the underlying sequence of 
Unicode characters (in an ideal world, this would return the sequence 
of Unicode character codes directly, but JavaScript strings are 
16-bits only, so some characters are actually represented as two 
16-bits surrogate pairs).

For instance, the Unicode character of a Japanese's "katakana letter 
small A" is 0x30A1, so if a DOM element contains such a letter, 
extracting it will yield a sequence with one integer 0x30A1, even if 
the document that was used to produce this element was encoded in 
Shift_JIS where this character is represented as a 0xA6 byte.

>From the perspective of the Presentation API, the communication 
channel sends a DOMString to the other end point. The actual bytes 
sent over the channel depend on the transmission protocol: WebSocket 
will typically turn the DOMString into Unicode characters (thus 
creating what WebIDL calls a USVString) and encode the result using 
UTF-8 for transmission, while other protocols could do differently, 
e.g. Unicode character codes as 32-bit values. What is important is 
that the receiving end point will eventually see a DOMString, again to
 be interpreted as a UTF-16 encoded serialization of a sequence of 
Unicode characters, independently of the character encoding that was 
used to load the HTML content.

What may of course happen in the katakana example is that the Chinese 
font used on the receiving browsing context does not contain the right
 glyph to represent a Katakana letter small A. The character would be 
rendered as an unknown one in that case (perhaps as a question mark or
 a square). This should never produce another character with a 
different meaning though!

[1] http://www.w3.org/2015/10/29-webscreens-minutes.html#item07
[2] http://heycam.github.io/webidl/#idl-DOMString



See https://github.com/w3c/presentation-api/issues/218
Received on Friday, 6 November 2015 15:24:18 UTC