RE: Defining generic Stream than considering only bytes (Re: CfC: publish WD of Streams API; deadline Nov 3) from Feras Moussa on 2013-10-31 (public-webapps@w3.org from October to December 2013)

From: Feras Moussa <feras.moussa@hotmail.com>
Date: Wed, 30 Oct 2013 23:51:25 -0700
To: Takeshi Yoshino <tyoshino@google.com>, Dean Landolt <dean@deanlandolt.com>
CC: Arthur Barstow <art.barstow@nokia.com>, public-webapps <public-webapps@w3.org>
Message-ID: <COL125-W17507DAFCF032C0BBD8C68940B0@phx.gbl>
A few comments inline below -

________________________________
> From: tyoshino@google.com 
> Date: Thu, 31 Oct 2013 13:23:26 +0900 
> To: dean@deanlandolt.com 
> CC: art.barstow@nokia.com; public-webapps@w3.org 
> Subject: Defining generic Stream than considering only bytes (Re: CfC: 
> publish WD of Streams API; deadline Nov 3) 
> 
> Hi Dean, 
> 
> On Thu, Oct 31, 2013 at 11:30 AM, Dean Landolt 
> <dean@deanlandolt.com<mailto:dean@deanlandolt.com>> wrote: 
> I really like this general concepts of this proposal, but I'm confused 
> by what seems like an unnecessary limiting assumption: why assume all 
> streams are byte streams? This is a mistake node recently made in its 
> streams refactor that has led to an "objectMode" and added cruft. 
> 
> Forgive me if this has been discussed -- I just learned of this today. 
> But as someone who's been slinging streams in javascript for years I'd 
> really hate to see the "standard" stream hampered by this bytes-only 
> limitation. The node ecosystem clearly demonstrates that streams are 
> for more than bytes and (byte-encoded) strings. 
> 
> 
> To glue Streams with existing binary handling infrastructure such as 
> ArrayBuffer, Blob, we should have some specialization for Stream 
> handling bytes rather than using generalized Stream that would 
> accept/output an array or single object of the type. Maybe we can 
> rename Streams API to ByteStream not to occupy the name Stream that 
> sounds like more generic, and start standardizing generic Stream. 

Dean, it sounds like your concern isnt just around the naming, but rather how data is read out of a stream. I've reviewed both the Node Streams and Buffer APIs previously, and from my understanding the data is provided as either a Buffer or String. This is on-par with ArrayBuffer/String. What data do you want to obtain that is missing, and for what scenario? Are these data types that already exist in the web platform, or new types you think are missing?

> 
> In my perfect world any arbitrary iterator could be used to 
> characterize stream chunks -- this would have some really interesting 
> benefits -- but I suspect this kind of flexibility would be overkill 
> for now. But there's good no reason bytes should be the only thing 
> people can chunk up in streams. And if we're defining streams for the 
> whole platform they shouldn't just be tied to a few very specific 
> file-like use cases. 
> If streams could also consist of chunks of strings (real, native 
> strings) a huge swath of the API could disappear. All of readType, 
> readEncoding and charset could be eliminated, replaced with simple, 
> composable transforms that turn byte streams (of, say, utf-8) into 
> string streams. And vice versa. 
> 
> 
> So, for example, XHR would be the point of decoding and it returns a 
> Stream of DOMStrings? 
> 
> Of course the real draw of this approach would be when chunks are 
> neither blobs nor strings. Why couldn't chunks be arrays? The arrays 
> could contain anything (no need to reserve any value as a sigil). 
> Regardless of the chunk type, the zero object for any given type 
> wouldn't be `null` (it would be something like '' or []). That means we 
> can use null to distinguish EOF, and `chunk == null` would make a 
> perfectly nice (and unambiguous) EOF sigil, eliminating yet more API 
> surface. This would give us a clean object mode streams for free, and 
> without node's arbitrary limitations. 
> 
> For several reasons, I chose to use .eof than using null. One of them 
> is to allow the non-empty final chunk to signal EOF than requiring one 
> more read() call. 
> 
> This point can be re-discussed. 

I thought EOF made sense here as well, but it's something that can be changed. Your proposal is interesting - is something like this currently implemented anywhere? This behavior feels like it'd require several changes elsewhere, since some APIs and libraries may explicitly look for an EOF.

> 
> The `size` of an array stream would be the total length of all array 
> chunks. As I hinted before, we could also leave the door open to 
> specifying chunks as any iterable, where `size` (if known) would just 
> be the `length` of each chunk (assuming chunks even have a `length`). 
> This would also allow individual chunks to be built of generators, 
> which could be particularly interesting if the `size` argument to 
> `read` was specified as a maximum number of bytes rather than the total 
> to return -- completely sensible considering it has to behave this way 
> near the end of the stream anyway... 
> 
> I don't really understand the last point. Could you please elaborate 
> the story and benefit? 
> 
> IIRC, it's considered to be useful and important to be able to cut an 
> exact requested size of data into an ArrayBuffer object and get 
> notified (the returned Promise gets resolved) only when it's ready. 
> 
> This would lead to a pattern like `stream.read(Infinity)`, which would 
> essentially say give me everything you've got soon as you can. 
> 
> In current proposal, read() i.e. read() with no argument does this. 
> 
> This is closer to node's semantics (where read is async, for added 
> scheduling flexibility). It would drain streams faster rather than 
> pseudo-blocking for a specific (and arbitrary) size chunk which 
> ultimately can't be guaranteed anyway, so you'll always have to do 
> length checks. 
> 
> (On a somewhat related note: why is a 0-sized stream specified to 
> throw? And why a SyntaxError of all things? A 0-sized stream seems 
> perfectly reasonable to me.) 
> 
> 0-sized Stream is not prohibited. 
> 
> Do you mean 0-sized read()/pipe()/skip()? I don't think they make much 
> sense. It's useful only when you want to sense EOF and it can be done 
> by read(1). 
> 
> What's particularly appealing to me about the chunk-as-generator idea 
> is that these chunks could still be quite large -- hundreds megabytes, 
> even. Just because a potentially large amount of data has become 
> available since the last chunk was processed doesn't mean you should 
> have to bring it all into memory at once. 
> 
> It's interesting. Could you please list some concrete example of such a 
> generator? 
>
Received on Thursday, 31 October 2013 06:51:52 UTC