Re: Overlap between StreamReader and FileReader from Jonas Sicking on 2013-08-22 (public-webapps@w3.org from July to September 2013)

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 22 Aug 2013 00:28:48 -0700
To: Isaac Schlueter <i@izs.me>
Cc: Austin William Wright <aaa@bzfx.net>, Domenic Denicola <domenic@domenicdenicola.com>, Takeshi Yoshino <tyoshino@google.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <CA+c2ei_9jDgUVQ+MuK-W1G-9K74T414y-VBoqXR-2ic8OyWvWQ@mail.gmail.com>
On Fri, Aug 9, 2013 at 12:47 PM, Isaac Schlueter <i@izs.me> wrote:
> Jonas,
>
> What does *progress* mean here?
>
> So, you do something like this:
>
>     var p = stream.read()
>
> to get a promise (of some sort).  That read() operation is (if we're
> talking about TCP or FS) a single operation.  There's no "50% of the
> way done reading" moment that you'd care to tap into.
>
> Even from a conceptual point of view, the data is either:
>
> a) available (and the promise is now fulfilled)
> b) not yet available (and the promise is not yet fulfilled)
> c) known to *never* be available because:
>   i) we've reached the end of the stream (and the promise is fulfilled
> with some sort of EOF sentinel), or
>  ii) because an error happened (and the promise is broken).
>
> So.. where's the "progress"?  A single read() operation seems like it
> ought to be atomic to me, and indeed, the read[2] function either
> returns some data (a), no data (c-i), raises EWOUDLBLOCK (b), or
> raises some other error (c-ii).  But, whichever of those it does, it
> does right away.  We only get woken up again (via
> epoll/kqueue/CPIO/etc) once we know that the file descriptor (or
> HANDLE in windows) is readable again (and thus, it's worthwhile to
> attempt another read[2] operation).

Hi Isaac,

Sorry for taking so long to respond. It took me a while to understand
where the disconnect came from. I also needed to mull over how a
consumer actually is likely to consume data from a Stream.

Having looked over the Node.js API more I think I see where the
misunderstanding is coming from. The source of confusion is likely
that Node.js and the proposal in [1] are very different.
Specificially, in Node.js the read() operation is synchronous and
operates on currently buffered data. In [1] the read() operation is
asynchronous and isn't restricted to just the currently buffered data.

>From my point of view there are two rough categories of ways of
reading data from an asynchronous Stream:

A) The Stream hands data to the consumer as soon as the data is
available. I.e. the stream doesn't buffer data longer than until the
next opportunity to fire a callback to the consumer.
B) The Stream allows the consumer to pull data out of the stream at
whatever pace, and in whatever block size, that the consumer finds
appropriate. If the data isn't yet available a callback is used to
notify the consumer when it is.

A is basically the Stream pushing the data to the consumer. And B is
the consumer pulling the data from the Stream.

In Node.js doing A looks something like:

stream.on('readable', function() {
  var buffer;
  while((buffer = stream.read())) {
    processData(buffer);
  }
});

In the proposal in [1] you would do this with the following code

stream.readBinaryChunked().ondata = function(e) {
  processData(e.data);
}

(side-note: it's unclear to me why the Node.js API forces
readable.read() to be called in a loop. Is that to avoid having to
flatten internal buffer fragments? Without that the two would
essentially be the same with some minor syntactical differences)

Here it definitely doesn't make sense to deliver progress
notifications. Rather than delivering a progress notification to the
consumer, you simply deliver the data.

The way you would do B in Node.js looks something like:

stream.on('readable', function() {
  var buffer;
  if ((buffer = stream.read(10))) {
    processTenBytes(buffer);
  }
});

The same thing using the proposal in [1] looks like

stream.readBinary(10).then(function(buffer) {
  processTenBytes(buffer);
});

An important difference here is that in the Node.js API, the 'read 10
bytes' operation either immediately returns a result, or it
immediately fails, depending on how much data we currently have
buffered. I.e. the read() call is synchronous. The caller is expected
to keep calling read(10) until the call succeeds. Though of course
there's also a very useful callback which makes the calling again very
easy. But between the calls to read() the Stream doesn't really have
knowledge that someone is waiting to read 10 bytes of information.

The API in [1] instead makes the read() call asynchronous. That means
that we can always let the call succeed (unless there's an error on
the stream of course). If we don't have enough data buffered
currently, we simply call the success callback later than if we had
had all requested data buffered already.

This is the place where delivering progress notifications could also
be done, though by no means this is an important aspect of the API.
But since the read() operation is asynchronous, we can deliver
progress notifications as we buffer up enough data to fulfill it. I
hope that makes it more clear how progress notifications play in.

So to be clear, progress notifications is by no means the important
difference here. The important difference is whether we make read() be
synchronous and operating on the current buffered data, or if we make
it asynchronous and operating on the full data stream.


As far as I can tell there is no difference capability-wise between
the two API. I.e. both handle things like congestion equally well, and
both handle both consumer-pulling and stream-pushing of data. The
difference is only in syntax, though that doesn't make the differences
any less important. Actually, the proposal in [1] is lacking the
ability to unshift() data, but that's an obvious capability that we
should add.


I think on the surface the proposal in [1] makes things more
convenient for the consumer. The consumer always gets a success call
for each read(). The border between when data arrives into the stream
is entirely transparent. And if we are moving to a world more based
around promises for asynchronous operations then this fits very well
there.

However I think in practice the Node.js API might have several
advantages. The main concern I have with the API in [1] is that there
might be performance implications of returning to the event loop for
every call to read(). Also, the fact that pull and push reading uses
the same API is pretty cool.

In general I suspect that most consumers don't actually know how many
bytes that they want to consume from a stream. I would expect that
many streams use formats with terminators rather than with
fixed-length units of data. And so I would expect it to be a common
pattern to guess at how many bytes can be consumer, then look at the
data and consume as much as possible, and then use unshift() to put
back any data that the consumer needs more data in order to consume.

Does anyone have examples of code that uses the Node.js API? I'd love
to look at how people practically end up consuming data?

> Are you proposing that every step in the TCP dance is somehow exposed
> on promise returned by read()?  That seems rather inconvenient and
> unnecessary, not to mention difficult to implement, since the TCP
> stack is typically in kernel space.

I'm not really sure how the TCP dance plays in here. But I definitely
wasn't planning on exposing that. I hope that the description above
makes it more clear how the [1] proposal works?

[1] http://lists.w3.org/Archives/Public/public-webapps/2013AprJun/0727.html
Received on Thursday, 22 August 2013 07:29:53 UTC