Defining generic Stream than considering only bytes (Re: CfC: publish WD of Streams API; deadline Nov 3) from Takeshi Yoshino on 2013-10-31 (public-webapps@w3.org from October to December 2013)

From: Takeshi Yoshino <tyoshino@google.com>
Date: Thu, 31 Oct 2013 13:23:26 +0900
To: Dean Landolt <dean@deanlandolt.com>
Cc: Arthur Barstow <art.barstow@nokia.com>, public-webapps <public-webapps@w3.org>
Message-ID: <CAH9hSJZCgqnLZCHwmLjPmeDi3khY+e1UHH-mwqrAccy_73Zxiw@mail.gmail.com>
Hi Dean,

On Thu, Oct 31, 2013 at 11:30 AM, Dean Landolt <dean@deanlandolt.com> wrote:

> I really like this general concepts of this proposal, but I'm confused by
> what seems like an unnecessary limiting assumption: why assume all streams
> are byte streams? This is a mistake node recently made in its streams
> refactor that has led to an "objectMode" and added cruft.
>
> Forgive me if this has been discussed -- I just learned of this today. But
> as someone who's been slinging streams in javascript for years I'd really
> hate to see the "standard" stream hampered by this bytes-only limitation.
> The node ecosystem clearly demonstrates that streams are for more than
> bytes and (byte-encoded) strings.
>
>
To glue Streams with existing binary handling infrastructure such as
ArrayBuffer, Blob, we should have some specialization for Stream handling
bytes rather than using generalized Stream that would accept/output an
array or single object of the type. Maybe we can rename Streams API to
ByteStream not to occupy the name Stream that sounds like more generic, and
start standardizing generic Stream.


> In my perfect world any arbitrary iterator could be used to characterize
> stream chunks -- this would have some really interesting benefits -- but I
> suspect this kind of flexibility would be overkill for now. But there's
> good no reason bytes should be the only thing people can chunk up in
> streams. And if we're defining streams for the whole platform they
> shouldn't *just* be tied to a few very specific file-like use cases.
>
> If streams could also consist of chunks of strings (real, native strings)
> a huge swath of the API could disappear. All of readType, readEncoding and
> charset could be eliminated, replaced with simple, composable transforms
> that turn byte streams (of, say, utf-8) into string streams. And vice versa.
>
>
So, for example, XHR would be the point of decoding and it returns a Stream
of DOMStrings?


> Of course the real draw of this approach would be when chunks are neither
> blobs nor strings. Why couldn't chunks be arrays? The arrays could contain
> anything (no need to reserve any value as a sigil). Regardless of the chunk
> type, the zero object for any given type wouldn't be `null` (it would be
> something like '' or []). That means we can use null to distinguish EOF,
> and `chunk == null` would make a perfectly nice (and unambiguous) EOF
> sigil, eliminating yet more API surface. This would give us a clean object
> mode streams for free, and without node's arbitrary limitations.
>

For several reasons, I chose to use .eof than using null. One of them is to
allow the non-empty final chunk to signal EOF than requiring one more
read() call.

This point can be re-discussed.


> The `size` of an array stream would be the total length of all array
> chunks. As I hinted before, we could also leave the door open to specifying
> chunks as any iterable, where `size` (if known) would just be the `length`
> of each chunk (assuming chunks even have a `length`). This would also allow
> individual chunks to be built of generators, which could be particularly
> interesting if the `size` argument to `read` was specified as a maximum
> number of bytes rather than the total to return -- completely sensible
> considering it has to behave this way near the end of the stream anyway...
>

I don't really understand the last point. Could you please elaborate the
story and benefit?

IIRC, it's considered to be useful and important to be able to cut an exact
requested size of data into an ArrayBuffer object and get notified (the
returned Promise gets resolved) only when it's ready.


> This would lead to a pattern like `stream.read(Infinity)`, which would
> essentially say *give me everything you've got soon as you can*.
>

In current proposal, read() i.e. read() with no argument does this.


>  This is closer to node's semantics (where read is async, for added
> scheduling flexibility). It would drain streams faster rather than
> pseudo-blocking for a specific (and arbitrary) size chunk which ultimately
> can't be guaranteed anyway, so you'll always have to do length checks.
>
> (On a somewhat related note: why is a 0-sized stream specified to throw?
> And why a SyntaxError of all things? A 0-sized stream seems perfectly
> reasonable to me.)
>

0-sized Stream is not prohibited.

Do you mean 0-sized read()/pipe()/skip()? I don't think they make much
sense. It's useful only when you want to sense EOF and it can be done by
read(1).


> What's particularly appealing to me about the chunk-as-generator idea is
> that these chunks could still be quite large -- hundreds megabytes, even.
> Just because a potentially large amount of data has become available since
> the last chunk was processed doesn't mean you should have to bring it all
> into memory at once.
>

It's interesting. Could you please list some concrete example of such a
generator?
Received on Thursday, 31 October 2013 04:24:19 UTC