W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2008

[whatwg] Byte-wise tokenization algorithm

From: Edward Z. Yang <edwardzyang@thewritingpot.com>
Date: Sat, 20 Dec 2008 23:32:01 -0500
Message-ID: <494DC6C1.9000908@thewritingpot.com>
I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:

1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the tokenization algorithm with byte-wise operations only?
I think it's possible, since all of the character matching parts of the
algorithm map to characters in ASCII space.

2. Would such an implementation be conforming?

Cheers,
Edward
Received on Saturday, 20 December 2008 20:32:01 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:08:46 UTC