[whatwg] document.write("\r"): the spec doesn't say how to handle it. from David Flanagan on 2011-11-03 (public-whatwg-archive@w3.org from November 2011)

From: David Flanagan <dflanagan@mozilla.com>
Date: Thu, 03 Nov 2011 11:13:04 -0700
Message-ID: <4EB2D9B0.6040002@mozilla.com>

On 11/3/11 4:21 AM, Henri Sivonen wrote:
> On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan<dflanagan at mozilla.com>  wrote:
>> Firefox, Chrome and Safari all seem to do the right thing: wait for the next
>> character before tokenizing the CR.
> See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247
I hadn't used the live dom viewer before.  That's really useful!

> Firefox tokenizes the CR immediately, emits an LF and then skips over
> the next character if it is an LF. When I designed the solution
> Firefox uses, I believed it was more correct and more compatible with
> legacy than whatever the spec said at the time.
I'm having a Duh! moment... I currently wait for the next character, but 
what you describe is also works, and allows the document.write() spec to 
make sense.

> Chrome seems to wait for the next character before tokenizing the CR.
>
>> And I think this means that the description of document.write needs to be changed.
> All along, I've felt thought that having U+0000 and CRLF handling as a
> stream preprocessing step was bogus and both should happen upon
> tokenization. So far, I've managed to convince Hixie about U+0000
> handling.
Each tokenizer state would have to add a rule for CR that said  "emit 
LF, save the current tokenizer state, and set the tokenizer state to 
"after CR state".  Actually, tokenizer states that already have a rule 
for LF or whitespace would have to integrate this CR rule into that 
rule.  Then new after CR state would have two rules. On LF it would skip 
the character and restore the saved state.  On anything else it would 
push the character back and restore the saved state.

>> Similarly, what should the tokenizer do if the document.write emits half of
>> a UTF-16 surrogate pair as the last character?
> The parser operates on UTF-16 code units, so a lone surrogate is emitted.

The spec seems pretty unambiguous that it operates on codepoints (though 
I implemented mine using 16-bit code units). ?13.2.1: " The input to the 
HTML parsing process consists of a stream of Unicode code points".  Also 
?13.2.2.3 includes a list of codepoints beyond the BMP that are parse 
errors.  And finally, the tests in 
http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test 
require unpaired surrogates to be converted to the U+FFFD replacement 
character.  (Though my experience is that modifying my tokenizer to pass 
those tests causes other tests to fail, which makes me wonder whether 
unpaired surrogates are only supposed to be replaced in some but not all 
tokenizer states)
Thanks, Henri!

     David

Received on Thursday, 3 November 2011 11:13:04 UTC