W3C home > Mailing lists > Public > whatwg@whatwg.org > June 2010

[whatwg] Never bring a regular expression knife to a turing complete gunfight. (Was: Allowing ">" in attribute values)

From: Nils Dagsson Moskopp <nils-dagsson-moskopp@dieweltistgarnichtso.net>
Date: Thu, 24 Jun 2010 18:15:01 +0200
Message-ID: <20100624181501.3983f2b2@desudesudesu>
"Benjamin M. Schwartz" <bmschwar at fas.harvard.edu> schrieb am Thu, 24
Jun 2010 11:20:10 -0400:

> Worldwide, regarding HTML, I'm sure there is 100 times more regular
> expression processing code than full-on lexing code.  Most code that
> processes HTML is embedded in scripts, doing some small
> special-purpose operation.

Regular expressions can parse strings written in a regular language.
That means no nested elements. And if you want to do a finite number of
nested elements, you have to write one regex for each level. This
requirement alone should kill most general purposes.

Specially crafted scripts, on the other hand, are easily adapted.

> Those regular expressions aren't going
> away.  Helping them break less is a noble cause.

I would argue exactly the opposite: Making stupid invalid parsers break
more would encourage people to use the right facilities. And we are not
even talking about achieving that: ">" in attributes was here for years,
and it is here to stay.

May Zalgo have mercy on your regexes !
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

-- 
Nils Dagsson Moskopp // erlehmann
<http://dieweltistgarnichtso.net>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: not available
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100624/a26f2205/attachment.pgp>
Received on Thursday, 24 June 2010 09:15:01 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:24 UTC