W3C home > Mailing lists > Public > public-script-coord@w3.org > October to December 2009

Re: Binary Data - possible topic for joint session

From: Maciej Stachowiak <mjs@apple.com>
Date: Fri, 06 Nov 2009 01:34:16 -0800
Cc: public-script-coord@w3.org, es-discuss Steen <es-discuss@mozilla.org>
Message-id: <8806E7A2-4C2A-40B7-AC0D-7C8197B85FA8@apple.com>
To: Maciej Stachowiak <mjs@apple.com>
+ es-discuss (since posting there seems to have piqued more interest)

 From reading over other proposals for binary data, I should mention  
the following operations that seem to be of interest to some  
communities but are not directly provided in this proposal (with ones  
I think are most appropriate for v1 first):

- Subrange/subdata/substring (get a Data that's a range from another's  
buffer - perhaps this could be optimized not to copy)
- Concatenation (specifically the ability to concatenate two immutable  
Data objects and get a new one back without having to go through the  
mutable type).
- Ability to convert to/from strings (with some hardcoded encoding or  
choice of encoding)
- Some or all of the operations of Array
- Base64 encode/decode
- Methods to compute various cryptographic hashes
- Find first or last occurrence of a byte or byte sequence (from a  
given offset)
- Split on a byte or byte sequence

I think it is possible to implement all of these with the primitives  
in my proposal, and in many cases the utility seems dubious (do you  
really want to map() or reduce() binary data one byte at a time?).  
Thus, I lean towards keeping the API relatively minimal, at least for  
starters.

Regards,
Maciej

On Nov 4, 2009, at 4:26 PM, Maciej Stachowiak wrote:

>
> Many APIs being developed for the Web platform would benefit from a  
> good way to store binary data. It would be useful for this to be  
> specified as part of the ECMAScript language, but it's also  
> plausible to make this a W3C spec that's only intended for use with  
> Web platform APIs. Here is an overview of some of the APIs that  
> could use such a data type, some notes on requirements and design  
> alternatives, and a strawman proposal.
>
> = If there's time, I'd like to discuss this at the joint TC-39/HTML  
> WG/Web Apps WG session.
>
> Some APIs that could use this:
>
>    XMLHttpRequest v2 - to receive and send binary data
>    WebSocket - to receive and send binary packets
>    File API - to read binary files
>    Canvas - to get image data in the binary form of an image format  
> (avoiding inefficiency of data: URLs)
>    various storage APIs - to store and retrieve binary data (in  
> combination with other APIs)
>    postMessage - to send binary data cross-window and cross-thread  
> (to Workers) efficiently
>
> I suspect there's more I am not thinking of. A convenient and  
> efficient way to represent binary data could also be useful for pure  
> ES programs.
>
>
> = Current de facto ways for Web apps to deal with binary data:
>
>    Array of numbers with one byte per entry
>    String with one byte stored per UTF-16 code unit
>    String with two bytes stored per UTF-16 code unit
>
> I hope it is obvious why these approaches are not great so I won't  
> go into detail.
>
>
> = Issues for the binary data API:
>
>     Name (potential bikeshed):
>         ByteArray
>         ByteVector
>         BinaryData
>         Data
>
> I like "Data" and similar names. Objective-C has NSData as a  
> distinct type for chunks of binary data - it's not treated as a type  
> of array. I think this makes sense. Often the fact that a chunk of  
> binary data can be treated as an octet sequence is incidental.
>
> ==  Mutable or Immutable (or both?)
>
> Immutable has a number of advantages:
>    - Can share backing store with chunks of binary data that the UA  
> already holds (e.g. in the network cache) without requiring copy-on- 
> write
>    - Can be passed cross-thread without copying, and without  
> breaking shared-nothing semantics
>    - Has the right semantics for passing cross-window (can make a  
> copy in cross-process case, but avoid it in same-process case; or  
> use shared memory in cross-process case without worrying about  
> locking or races)
>    - Follows the approach of ES strings, which are immutable
>
> But there's some significant disadvantages too:
>    - What if you actually want to mutate some piece of binary data  
> you got before passing it along? How to do this efficiently?
>    - What if you want to build a new binary data item from scratch?
>
> With strings, the answer to both building and mutation is to extract  
> pieces and build a new string by concatenation. But that's probably  
> not efficient or convenient enough for the binary data case.
>
> Possible solution: provide immutable Data, but have a DataBuilder  
> class to allow creating new data items or mutating copies of  
> existing ones, which can then give a final immutable product.
>
>
> == What Operations?
>
> Operation set could be a full set of array-like operations,  
> absolutely minimal (just accessors for individual bytes), or middle  
> ground (byte-level accessors plus a few bulk operations like the  
> equivalent of memcpy). I like the middle ground.
>
> == Rough API Proposal
>
> Here's a sketch of a binary data API that's immutable (with mutable  
> builder class), and provides a middle-ground set of operations. The  
> basic idea is that binary data should be considered a first-class  
> datatype in its own right, just as strings are, rather than being  
> thought of as a kind of array.
>
> Data -- global constructor
>    When called or invoked as a constructor with a number parameter,  
> return a new Data object of the specified size, filled with all zero  
> bytes.
>
> Data.prototype -- the initial Data prototype
>
> Data.prototype.builderCopy()
>
>    When called with a Data instance as the this parameter, return a  
> new DataBuilder object starting with the same size and a copy of the  
> bytes in this Data object.
>
> Data instance properties:
>
>    length - size of the Data object - read-only
>    index properties - individual bytes of the Data (similar to array  
> access) - read-only
>
> DataBuilder -- global constructor
>    When called or invoked as a constructor with a number parameter,  
> return a new DataBuilder object of the specified size, filled with  
> all zero bytes.
>
> DataBuilder.prototype.builderCopy()
>
>    When called with a DataBuilder instance as the this parameter,  
> return a new DataBuilder object starting with the same size and a  
> copy of the bytes in this Data object.
>
> DataBuilder.prototype.copyRange(dstStart, srcObject, srcStart, srcEnd)
>
>    When called with a DataBuilder instance as the this parameter,  
> copy bytes from srcObject starting at offset srcStart up to offset  
> srcEnd. srcObject can be a Data or a DataBuilder, and can be the  
> same as the "this" object. Overlapping ranges are guaranteed to be  
> copied correctly. dstStart is the offset in this DataBuilder at  
> which to provide writing.
>
> DataBuilder.prototype.fill(byte, dstStart, dstEnd)
>
>   Fill with "byte" from dstStart to dstEnd.
>
> DataBuilder.prototype release()
>
>   Return a Data object of the same size and containing the same  
> bytes as this DataBuilder, and at the same time reset this  
> DataBuilder to 0 length. This is so that the new Data object can  
> adopt the buffer of this DataBuilder without copying, which is what  
> is commonly desired.
>
> DataBuilder instance properties:
>
>    length - size of the Data object - read-write
>    index properties - individual bytes of the DataBuilder (similar  
> to array access) - read-write
>
>
> Rationale:
>
>   - copyRange() and fill() are the only higher-level operations  
> provided, because they can be implemented much more efficiently for  
> large ranges in native code than in ECMAScript.
>
>   - Data would be returned and taken by all Web APIs, its  
> immutability allows binary data to be passed around without copying.
>
>   - DataBuilder allows creation and mutation while minimizing copies  
> and letting most of a system maintain the benefits of immutability.
>
>   - DataBuilder.prototype.release() is specifically designed to let  
> a program use mutation to build up a chunk of binary data, then pass  
> it off to code that should not mutate it or across boundaries with  
> shared-nothing semantics (like Workers), without requiring a copy  
> after initially building.
>
> Sorry that this is so sketchy, but I thought this would make a good  
> starting point for discussion.
>
> Regards,
> Maciej
>
>
>
>
>
>
>
>
>
Received on Friday, 6 November 2009 09:34:57 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:02 UTC