Re: [whatwg/encoding] Add TextEncoderStream and TextDecoderStream transform streams (#149)

> Is there a reason why TextDecoderStream doesn't provide a mode with the semantics of the "decode" algorithm for BOM handling

The short answer is "because TextDecoder doesn't". I don't know the historical reason why TextDecoder doesn't.

I agree that BOM sniffing is necessary to parse legacy content, and that it can be hard to get right. However, I'd like to discourage people from relying on this behaviour. I'd like to move away from any kind of heuristic behaviour and towards a world where everyone uses UTF-8, and that's what the default behaviour encourages.

If we find compelling use cases for easily reading legacy content we may need to support it in future, but my personal preference would be to leave it out until it is proven necessary.

> Also, does one input chunk always result in one output chunk that corresponds to potential pending partial code unit sequence and the input chunk except for potential partial code unit sequence at the end of the input chunk?

Input chunk to output chunk correspondance is normally 1:1, except that we don't output empty chunks, and an extra chunk may be output at the end of the stream if we discover the input was incomplete.

>  (If not, I'm a bit worried about Web content developing a dependency on browser-specific chunk boundaries where the boundaries are not supposed to mean anything.)

The semantics are strictly greedy. Given the same chunks as input, every browser is required to have exactly the same chunk boundaries in the output.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/pull/149#issuecomment-412076967

Received on Friday, 10 August 2018 13:08:45 UTC