W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2008

[whatwg] Byte-wise tokenization algorithm

From: Ian Hickson <ian@hixie.ch>
Date: Sun, 21 Dec 2008 05:41:33 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0812210537590.30197@hixie.dreamhostps.com>
On Sat, 20 Dec 2008, Edward Z. Yang wrote:
> I am currently working on a PHP5 implementation of the HTML5 
> specification. PHP has abysmal Unicode support, and implementing Unicode 
> streams in userspace may be unacceptablu slow. Thus, my questions:
> 1. Given an input stream that is known to be valid UTF-8, is it possible 
> to implement the tokenization algorithm with byte-wise operations only? 
> I think it's possible, since all of the character matching parts of the 
> algorithm map to characters in ASCII space.

Yes. (At least, that's the intent; if you find anything that contradicts 
that, please let me know.)

> 2. Would such an implementation be conforming?

Looking just at parsing, yes, probably... But an HTML5 implementation, 
according to the spec, must at a minimum support the UTF-8 and 
Windows-1252 encodings, so the overall implementation might not depending 
on exactly how this is done.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Saturday, 20 December 2008 21:41:33 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:08:46 UTC