Re: String to ArrayBuffer

On 1/11/2012 4:22 PM, Boris Zbarsky wrote:
> On 1/11/12 6:03 PM, Charles Pritchard wrote:
>> Is there any instance in practice where DOMString as exposed to the
>> scripting environment is not implemented as a unicode string?
> I don't know what you mean by that.
> The point is, it's trivial to construct JS strings that contain 
> arbitrary sequences of 16-bit units (using fromCharCode or \u 
> escapes).  Nothing anywhere in JS or the DOM per se enforces that 
> strings are valid UTF-16 (which is the way that an actual Unicode 
> string would be encoded as a JS string).

My [wrong] understanding was that DOMString referred to valid unicode.

"The DOMString type corresponds to the set of all possible sequences of 
16 bit unsigned integer code units. Such sequences are commonly 
interpreted as UTF-16 encoded strings [RFC2781] although this is not 
required... Nothing in this specification requires a DOMString value to 
be a valid UTF-16 string."

"The DOMString type is used to store [Unicode] characters as a sequence 
of 16-bit units using UTF-16 as defined in [Unicode] and Amendment 1 of 
[ISO/IEC 10646]." There are some normalization notes, but otherwise, 
it's close enough to saying it stores Unicode, but it can handle all 
16bit combinations.

For "historic reasons" WindowBase64 throws an error if input is not 
within Unicode range.

>> I realize that internally, DOMString may be implemented as a 16 bit
>> integer + length;
> Not just internally.  The JS spec and the DOM spec both explicitly say 
> that this is what strings are: an array of 16-bit integers.

WebIDL and DOM define "DOMString", of course. JS defines "The String 
Type" in 8.4. They are intended to be the same.

"The  String type is the set of all finite ordered sequences of zero or 
more 16-bit unsigned integer values .... When a String contains actual 
textual data, each element is considered to be a single UTF-16 code 
unit.  Whether or not this is the actual storage format of a String, the 
characters within a String are numbered by their initial code unit 
element position as though they were represented using UTF-16."

>> Browsers do the same thing with WindowBase64, though it's specified as
>> DOMString, in practice (as the notes say), it's unicode.
> If you look at the actual processing model, you take the input array 
> of 16-bit integers, throw if any is not in the set { 0x2B, 0x2F, 0x30 
> } union [0x41,0x5A] union [0x61,0x6A] and then treat the rest as ASCII 
> data (which at that point it is).
> It defines this in terms of "Unicode" but that's just because any JS 
> string that satisfies the above constraints can be considered a 
> "Unicode" string if one wishes.
>> Web Storage, also, only works with unicode.
> I'm not familiar with the relevant part of Web Storage.  Can you cite 
> the relevant part please?

The character code conversion gets weird. If you'd explain this in the 
proper terms, I'd appreciate it.

Load a binary resource via the old charset hack.

Save the resulting string into localStorage. There are some conversion 
issues. I am not using the right vocabulary.
I know the list has seen the issue before, and I'll bet someone here can 
explain it succinctly.

// Image files are easiest to try this with.
// From the article:
function load_binary_resource(url) {
   var req = new XMLHttpRequest();'GET', url, false);
   //XHR binary charset opt by Marcus Granado 2006 
   req.overrideMimeType('text\/plain; charset=x-user-defined');
   if (req.status != 200) return '';
   return req.responseText;
var x = load_binary_resource('imageurl.png'); = x; ==; // will return false.

Received on Thursday, 12 January 2012 03:51:41 UTC