W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2006

[whatwg] Provding Better Tools

From: J. King <jking@dark-phantasy.com>
Date: Sun, 03 Dec 2006 22:57:07 -0500
Message-ID: <op.tj0nxhyaplbshj@briann>
On Sun, 03 Dec 2006 20:10:34 -0500, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> My experience optimizing PHP Markdown, and building the custom mixed  
> Markdown/HTML-block pesudo-tokenizer of PHP Markdown Extra, tells me  
> that it'll probably stay very slow as long as the implementation is made  
> of PHP code.

Yeah, it is.  I'm not much of a programmer, but I thought the algorithm  
too useful not to try and implement.

> Assuming you've implemented the algorithm in the spec as PHP code, you  
> could probably make it faster by using regular expressions in the  
> tokenization steps instead of iterating character by character. For  
> instance, you could implement many of the tokenizer states by matching  
> from the start of a string with a regex. And maybe then it'll also be  
> possible to combine a couple of states within the same regex too.

This is precisely what I've done.  Before I did said optimization, the  
parser would crash more often than not on a document larger than a few  
kilobytes on my machine.

> The more we replace PHP code by regular expressions, the faster it'll  
> go, but further we deviate from the processing algorithm described in  
> the spec. I wonder how far we could go while keeping the exact same  
> behaviour.

My pattern optimization is pretty simple: when switching states the parser  
first tries matching whatever range of characters will keep the machine in  
the same state, and then acts as normal on the first character that  
doesn't match.  There is, effectively, next to no deviation from the spec  
short of emitting one char token per unbroken string rather than one token  
per character.  Since the tokens are merged into one text node in the tree  
builder anyway, the deviation is essentially nil.

> The true good solution would be to have a parser implemented in C and  
> available through every standard installation of PHP. It could be used  
> by other languages too.

I am keeping my fingers crossed, hoping that someone much more  
knowledgable than I will do this. :)

J. King
Received on Sunday, 3 December 2006 19:57:07 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:50 UTC