[whatwg] Byte-wise tokenization algorithm from Edward Z. Yang on 2008-12-21 (public-whatwg-archive@w3.org from December 2008)

From: Edward Z. Yang <edwardzyang@thewritingpot.com>
Date: Sat, 20 Dec 2008 23:32:01 -0500
Message-ID: <494DC6C1.9000908@thewritingpot.com>

I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:

1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the tokenization algorithm with byte-wise operations only?
I think it's possible, since all of the character matching parts of the
algorithm map to characters in ASCII space.

2. Would such an implementation be conforming?

Cheers,
Edward

Received on Saturday, 20 December 2008 20:32:01 UTC