- From: David Flanagan <dflanagan@mozilla.com>
- Date: Thu, 03 Nov 2011 11:13:04 -0700
On 11/3/11 4:21 AM, Henri Sivonen wrote: > On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan<dflanagan at mozilla.com> wrote: >> Firefox, Chrome and Safari all seem to do the right thing: wait for the next >> character before tokenizing the CR. > See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247 I hadn't used the live dom viewer before. That's really useful! > Firefox tokenizes the CR immediately, emits an LF and then skips over > the next character if it is an LF. When I designed the solution > Firefox uses, I believed it was more correct and more compatible with > legacy than whatever the spec said at the time. I'm having a Duh! moment... I currently wait for the next character, but what you describe is also works, and allows the document.write() spec to make sense. > Chrome seems to wait for the next character before tokenizing the CR. > >> And I think this means that the description of document.write needs to be changed. > All along, I've felt thought that having U+0000 and CRLF handling as a > stream preprocessing step was bogus and both should happen upon > tokenization. So far, I've managed to convince Hixie about U+0000 > handling. Each tokenizer state would have to add a rule for CR that said "emit LF, save the current tokenizer state, and set the tokenizer state to "after CR state". Actually, tokenizer states that already have a rule for LF or whitespace would have to integrate this CR rule into that rule. Then new after CR state would have two rules. On LF it would skip the character and restore the saved state. On anything else it would push the character back and restore the saved state. >> Similarly, what should the tokenizer do if the document.write emits half of >> a UTF-16 surrogate pair as the last character? > The parser operates on UTF-16 code units, so a lone surrogate is emitted. The spec seems pretty unambiguous that it operates on codepoints (though I implemented mine using 16-bit code units). ?13.2.1: " The input to the HTML parsing process consists of a stream of Unicode code points". Also ?13.2.2.3 includes a list of codepoints beyond the BMP that are parse errors. And finally, the tests in http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test require unpaired surrogates to be converted to the U+FFFD replacement character. (Though my experience is that modifying my tokenizer to pass those tests causes other tests to fail, which makes me wonder whether unpaired surrogates are only supposed to be replaced in some but not all tokenizer states) Thanks, Henri! David
Received on Thursday, 3 November 2011 11:13:04 UTC