[whatwg] Byte-wise tokenization algorithm from Ian Hickson on 2008-12-21 (public-whatwg-archive@w3.org from December 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Sun, 21 Dec 2008 05:41:33 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0812210537590.30197@hixie.dreamhostps.com>

On Sat, 20 Dec 2008, Edward Z. Yang wrote:
>
> I am currently working on a PHP5 implementation of the HTML5 
> specification. PHP has abysmal Unicode support, and implementing Unicode 
> streams in userspace may be unacceptablu slow. Thus, my questions:
> 
> 1. Given an input stream that is known to be valid UTF-8, is it possible 
> to implement the tokenization algorithm with byte-wise operations only? 
> I think it's possible, since all of the character matching parts of the 
> algorithm map to characters in ASCII space.

Yes. (At least, that's the intent; if you find anything that contradicts 
that, please let me know.)


> 2. Would such an implementation be conforming?

Looking just at parsing, yes, probably... But an HTML5 implementation, 
according to the spec, must at a minimum support the UTF-8 and 
Windows-1252 encodings, so the overall implementation might not depending 
on exactly how this is done.

HTH,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Saturday, 20 December 2008 21:41:33 UTC