[whatwg/encoding] Clarify that encoding tokens are scalar values (#195) from Andreu Botella on 2020-01-14 (public-webapps-github@w3.org from January 2020)

From: Andreu Botella <notifications@github.com>
Date: Mon, 13 Jan 2020 23:57:37 -0800
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/195@github.com>

It seems like the fact that Unicode tokens are scalar values, rather than code points, is far from clear in the spec. Other than the usage of `USVString` in IDL, the only mention of scalar values is in the definition of encoding. In fact, the definition of token refers explicitly to code points, rather than scalar values.

This could actually result in a security issue if a specification weren't careful when using the encoding hooks – encoding handlers based on indices would raise an error on surrogate code points, but the UTF-8 handler would go along with it, returning a byte sequence which would fail on decoding.

I propose adding some text to the note in the hooks section informing specs that they should only invoke the encoding algorithms with streams built from a `USVString`, as well as adding an assertion in the process algorithm that, if `encoderDecoderInstance` is an encoder instance, `input` is not a surrogate.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/195

Received on Tuesday, 14 January 2020 07:57:39 UTC