- From: Philip Taylor <excors+whatwg@gmail.com>
- Date: Sun, 21 Dec 2008 17:18:32 +0000
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson <ian at hixie.ch> wrote: > On Sat, 20 Dec 2008, Edward Z. Yang wrote: >> >> 1. Given an input stream that is known to be valid UTF-8, is it possible >> to implement the tokenization algorithm with byte-wise operations only? >> I think it's possible, since all of the character matching parts of the >> algorithm map to characters in ASCII space. > > Yes. (At least, that's the intent; if you find anything that contradicts > that, please let me know.) I think there are some cases where it still should work but you might have to be a little careful - e.g. "<table>foo" notionally results in three parse errors according to the spec (one for each character token which gets foster-parented), so "<table>?" results in one if you work with Unicode characters but three if you treat each UTF-8 byte as a separate character token. But in practice, tokenisers emit sequence-of-many-characters tokens instead of single-character tokens, so they only emit one parse error for "<table>foo", and the html5lib test cases assume that behaviour, and it should work identically if you have sequence-of-many-bytes tokens instead. (Apparently only the distinction between 0 and more-than-0 parse errors is important as far as the spec is concerned, since that has an effect on whether the document is conforming; but it seems useful for implementors to share test cases that are precise about exactly where all the parse errors are emitted, since that helps find bugs, and so the parse error count is relevant.) -- Philip Taylor excors at gmail.com
Received on Sunday, 21 December 2008 09:18:32 UTC