- From: Ian Hickson <ian@hixie.ch>
- Date: Sun, 21 Dec 2008 05:41:33 +0000 (UTC)
On Sat, 20 Dec 2008, Edward Z. Yang wrote: > > I am currently working on a PHP5 implementation of the HTML5 > specification. PHP has abysmal Unicode support, and implementing Unicode > streams in userspace may be unacceptablu slow. Thus, my questions: > > 1. Given an input stream that is known to be valid UTF-8, is it possible > to implement the tokenization algorithm with byte-wise operations only? > I think it's possible, since all of the character matching parts of the > algorithm map to characters in ASCII space. Yes. (At least, that's the intent; if you find anything that contradicts that, please let me know.) > 2. Would such an implementation be conforming? Looking just at parsing, yes, probably... But an HTML5 implementation, according to the spec, must at a minimum support the UTF-8 and Windows-1252 encodings, so the overall implementation might not depending on exactly how this is done. HTH, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Saturday, 20 December 2008 21:41:33 UTC