- From: Ian Hickson <ian@hixie.ch>
- Date: Mon, 13 Feb 2012 22:56:36 +0000 (UTC)
On Mon, 19 Dec 2011, Henri Sivonen wrote: > On Wed, Dec 14, 2011 at 2:00 AM, Ian Hickson <ian at hixie.ch> wrote: > > I can remove the text "one at a time", if you like. Would that be > > satisfactory? Or I guess I could change the spec to say that the > > parser should process the characters, rather than the tokenizer, since > > really it's the whole shebang that needs to be involved (stream > > preprocessor and everything). Any opinions on what the right text is > > here? > > I'd like the CRLF preprocessing to be defined as an eager stateful > operation so that there's one bit of state: "last was CR". Then, input > is handled as follows: > If the input character is CR, set "last was CR" to true and emit LF. > If the input character is LF and "last was CR" is true, don't emit > anything and set "last was CR" to false. > If the input character is LF and "last was CR" is is false, emit LF. > Else set "last was CR" to false and emit the input character. I've done something like this (but simpler to spec). I've also done the second change I suggest above. On Thu, 3 Nov 2011, David Flanagan wrote: > > The spec seems pretty unambiguous that it operates on codepoints (though > I implemented mine using 16-bit code units). ?13.2.1: " The input to > the HTML parsing process consists of a stream of Unicode code points". > Also ?13.2.2.3 includes a list of codepoints beyond the BMP that are > parse errors. And finally, the tests in > http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test > require unpaired surrogates to be converted to the U+FFFD replacement > character. (Though my experience is that modifying my tokenizer to pass > those tests causes other tests to fail, which makes me wonder whether > unpaired surrogates are only supposed to be replaced in some but not all > tokenizer states) This has changed a bit. In particular, "Unicode code point" is currently defined in a way that is (in theory) black-box indistinguishable from UTF-16 handling, but without making "astral characters" into second-class citizens. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 13 February 2012 14:56:36 UTC