- From: Ashley Sheridan <ash@ashleysheridan.co.uk>
- Date: Fri, 25 Jun 2010 13:39:13 +0100
On Fri, 2010-06-25 at 13:28 +0100, Kornel Lesinski wrote: > > A agree disallowing ">" chars in attributes greatly simplifies parsing. Not > > only with regular expressions, but any parsing. > > If ">" are allowed, it means that in order to found the end of the element > > you do have to read all attributes before. This is very costy. > > You just need two extra states in the parser (toggled on " or '). I wouldn't call that "very costly". > > > Just an > > example but they are many others: let's image you'd like to convert an HTML > > document into flat text. To simplify you're algorithm you've chosen to > > retrieve the content of the <body> element and then to delete all elements > > in it. This is very fast if ">" are not allowed in attributes because you're > > able found elements bounds just by searching "<" and then ">". But if ">" > > are allowed, the operation gets much more complicated, and you spend much > > more time to scan all elements. > > Conversion of HTML to text is more complicated than that - e.g. you shouldn't turn foo<br>bar into foobar, but you have to keep foo<b>bar as foobar. Implied <body> is allowed, you should extract <img alt>, you have to decode entities, etc. I think check for a single character is just a drop in the ocean in such code. > > And if you're not concerned about accuracy of conversion, you can ignore the fact that ">" is allowed too. It's just going to be yet another tradeoff among many other, much bigger ones. > > >> Also take into consideration that even if ">" was forbidden in the spec, > > it wouldn't mean it doesn't happen in > >> the wild. Since it works in browsers, you'd still have to support it if > > you wanted to parse markup from the web. > > > > Allowing it in the spec and how the browser should behave if it is anyway > > are two different things. > > If you're parsing markup from the web, you have to support invalid markup that browsers accept, not merely pure markup that spec allows. > > There are reasons to disallow ">", but I'm not convinced that parsing performance is one of them. > I think maybe the best reason for disallowing it I've seen is where attributes aren't correctly quoted: <foo bar="foobar> Which could potentially break everything. At the moment, most browsers deal with this as a missing quote, but allowing > in the value, they should include content after the >. Parsing-wise, I don't see it being any more difficult except for very basic parsing methods, and any time difference should be negligible. Thanks, Ash http://www.ashleysheridan.co.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100625/e7cc8665/attachment.htm>
Received on Friday, 25 June 2010 05:39:13 UTC