Re: [whatwg] Handling out of memory issues with getImageData/createImageData from Mark S. Miller on 2015-09-27 (public-whatwg-archive@w3.org from September 2015)

From: Mark S. Miller <erights@google.com>
Date: Sun, 27 Sep 2015 08:30:30 -0700
To: Anne van Kesteren <annevk@annevk.nl>
Cc: Justin Novosad <junov@google.com>, WHAT Working Group <whatwg@whatwg.org>, Mark Miller <erights@gmail.com>
Message-ID: <CABHxS9iXYqTZjUq71tV2-uxWk277crRd0tu=gbgvJMPdfR=daw@mail.gmail.com>
On Sat, Sep 26, 2015 at 7:34 AM, Anne van Kesteren <annevk@annevk.nl> wrote:

> On Fri, Sep 25, 2015 at 4:48 PM, Justin Novosad <junov@google.com> wrote:
> > Currently there is no spec'ed behavior for handling out-of memory issues
> > for the specific case of attempting to allocate a large buffer through
> > image data APIs.
>
> Actually, there is no specified behavior for out-of-memory behavior,
> period. This is a problem that starts with the ECMAScript standard and
> everything that builds upon it.
>
> I have seen Mark Miller discuss some of the issues surrounding this
> and perhaps even the necessity to eventually define it, but so far
> this has not happened. Not sure if the full story is documented
> somewhere. Mark?
>
>
> https://esdiscuss.org/topic/using-max-stack-limit-to-determine-current-js-engine-and-revision#content-7
> indicates there may be security issues with throwing out-of-memory
> exceptions.


Well, the full story is never documented ;). However, that post and the
links from there:

http://www.eros-os.org/pipermail/e-lang/2007-January/011817.html
https://github.com/google/caja/issues/460

are a good start. The security issue is serious and needs to be fixed. It
cannot practically be fixed by libraries without additional help by the
platform. The problem is that

    * In a language that implicitly allocates everywhere, like JavaScript,
Java, and many other oo languages, it is impossible to prevent a code from
causing OOM
    * If OOM is thrown (see the first link for Java/Joe-E issues), and the
language has try/finally, it is impossible to prevent the OOM being masked.
    * In such languages, it is impossible to program defensively against
the pervasive possibility of OOM -- if execution simply resumes in that
context as if nothing bad happened.

In Joe-E we took the painful step of outlawing the Java try/finally from
the Joe-E subset of Java for this reason. There was no other reason to
outlaw try/finally as there's nothing else inherently unsafe about it. We
really tried to find another solution but under our constraints -- no
rewriting of the Java nor change to the JVM -- we could not.

By preventing Joe-E code from catching VirtualMachineErrors and from doing
a try/finally, the Joe-E code was preemptively terminated immediately on
occurrence of a VirtualMachineError. Only the spawner of the Joe-E
computation could react to this termination of the computation it spawned.

This mirrors one of the many thing that Erlang gets right. When a program
is confused, that program is the last one you want to ask to recover from
the confusion, since it is already impaired by its own confusion. If you
don't know what is still true, you are unlikely to engage in repair actions
correctly. Better to preemptively terminate some large unit containing the
confusion and recover by
    * restarting from earlier known good state, or
    * if this is not yet feasible, propagating the termination to a yet
larger granularity of computation.

This is the "fail stop" philosophy of "Death Before Confusion". The
contrasting philosophy appropriate for some computation is "best efforts".
Some JavaScript code is best served by one and some by the other. Security
enforcing code must maintain its own integrity at the price of termination
(and restart from some coarser grain). Web pages using JavaScript only to
spice up the user experience are often best served by best efforts. Erlang
itself is an interesting case study, as its original motivating problem --
telephone switches -- places a higher priority on uptime than on integrity.
Nevertheless, both Erlang and the Tandem non-stop architecture found that
uptime in the large is best served by fail-stop in the small combined with
coarser-grain recovery logic.

Because JavaScript comes from such a long legacy of de facto best efforts
architecture, I think a direct du jure shift to fail-stop is unlikely.
Instead, what we need is a trap-handling mechanism (Erlang "supervisor".
KeyKOS "keeper"), where different policies can be expressed by user-defined
trap handlers. When multiple policies co-exist,  For concreteness, I'll
make here a first sketch:

On OOM, the platform first scans the stack to enumerate all realms
represented by in-progress stack frames as of that moment. (In progress
meaning that the stack frame still would have been there even if that
platform had implemented proper-tail-call.) It gathers the trap handlers
associated with each of those realms. Each trap handler is a pair of a
string and an optional function.

The string indicates the choice of trap handling strategy, where these
strategies are ordered by severity. Among the gathered strategies, the most
severe win and the rest are discarded. From least to most severe, they are

"THROW"
"ABORT_JOB"
"REFRESH"
"ABORT_EVENT_LOOP"

Except for "THROW", all the rest cause the current turn/job to first be
preemptively terminated without running catch or finally blocks. If during
any one trap handling strategy we run out of reserve memory, then we
automatically escalate to the next more severe strategy. Alternatively, if
a trap handling function is itself associated with yet another
otherwise-uninvolved realm with its own trap handler, then an OOM inside
this trap handler might be handler by that handler's handler.

This is just a first sketch. It is probably too complicated in some ways
and insufficiently general in others. I post it mostly to get the
discussion started.

-- 
    Cheers,
    --MarkM
Received on Sunday, 27 September 2015 15:30:57 UTC