[whatwg] Provding Better Tools from Michel Fortin on 2006-12-04 (public-whatwg-archive@w3.org from December 2006)

From: Michel Fortin <michel.fortin@michelf.com>
Date: Sun, 3 Dec 2006 20:10:34 -0500
Message-ID: <B2D51A52-07E6-4427-9D8E-3232F248E3E7@michelf.com>

Le 3 d?c. 2006 ? 17:04, J. King a ?crit :

> I am.  It's not anywhere near finished yet, but the parser so far  
> goes through the whole document and spits out the appropriate  
> tokens; I just haven't done anything with said tokens yet, mainly  
> because I was discouraged by PHP's DOM implementation.
> My parser is also slow as molasses, unfortunately.

My experience optimizing PHP Markdown, and building the custom mixed  
Markdown/HTML-block pesudo-tokenizer of PHP Markdown Extra, tells me  
that it'll probably stay very slow as long as the implementation is  
made of PHP code.

Assuming you've implemented the algorithm in the spec as PHP code,  
you could probably make it faster by using regular expressions in the  
tokenization steps instead of iterating character by character. For  
instance, you could implement many of the tokenizer states by  
matching from the start of a string with a regex. And maybe then  
it'll also be possible to combine a couple of states within the same  
regex too.

The more we replace PHP code by regular expressions, the faster it'll  
go, but further we deviate from the processing algorithm described in  
the spec. I wonder how far we could go while keeping the exact same  
behaviour.

The true good solution would be to have a parser implemented in C and  
available through every standard installation of PHP. It could be  
used by other languages too.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/

Received on Sunday, 3 December 2006 17:10:34 UTC