- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Fri, 4 Nov 2011 09:44:08 +0200
On Thu, Nov 3, 2011 at 8:13 PM, David Flanagan <dflanagan at mozilla.com> wrote: > Each tokenizer state would have to add a rule for CR that said? "emit LF, > save the current tokenizer state, and set the tokenizer state to "after CR > state". The Validator.nu/Gecko tokenizer returns a "last input code unit processed was CR" flag to the caller. If the tokenizer sees a CR, the tokenizer processes it and returns to the caller immediately with the flag set to true. The caller is responsible for checking if the next input code unit is an LF, skipping over it and calling the tokenizer again. This way, the tokenizer itself does not need to have the capability of skipping over a character and the same capabilities that are normally used for dealing with arbitrary buffer boundaries and early returns after script end tags (or timers before the parser moved off the main thread) work. >> The parser operates on UTF-16 code units, so a lone surrogate is emitted. > > The spec seems pretty unambiguous that it operates on codepoints The spec is empirically wrong. The wrongness has been reported. The spec tries to retrofit Unicode theoretical purity onto legacy where no purity existed. The tokenizer operates on UTF-16 code units. document.write() feeds UTF-16 code units to the tokenizer without lone surrogate preprocessing. The tokenizer or the tree builder don't do anything about lone surrogates. When consuming a byte stream, the converter that converts (potentially unaligned and potentially foreign byte-order) the UTF-16-encoded byte stream into a stream of UTF-16 code units is responsible for treating unpaired surrogates as conversion errors. Sorry about not mentioning earlier that the problematic tests are also problematic in this sense. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Friday, 4 November 2011 00:44:08 UTC