[whatwg] Allowing ">" in attribute values from David Workman on 2010-06-25 (public-whatwg-archive@w3.org from June 2010)

From: David Workman <workmad3@gmail.com>
Date: Fri, 25 Jun 2010 10:48:03 +0100
Message-ID: <AANLkTimEr_wET-eSVz5Va5Usw9a40gasHC-g6uVToNxO@mail.gmail.com>

I disagree, there are so many other things you need to take account of if
you were (for example) getting all the text out of an HTML document. Text
and markup in comment nodes would just through a spanner in the works for
starters.

It all boils down to the fact that the only thing disallowing ">" in
attribute values does is simplify regex scanning of HTML (which *isn't*
parsing). Seeing as regex parsing of HTML is wrong in so many ways, and
isn't something that should be (IMO) encouraged in the slightest, I don't
see any reason to change the allowance for ">" characters in attributes.

David W.

On 25 June 2010 10:46, Skrol29
<skrol29forum+whatwg at gmail.com<skrol29forum%2Bwhatwg at gmail.com>
> wrote:

> On 24 Jun 2010, at 14:11, Benjamin M. Schwartz wrote:
>
> >>> Why would it simplify parsing?
>
> >> It greatly simplifies parsing when you just want to extract entire
> >> tags, without immediately parsing the attributes.
>
> >If you mean "parsing" with regular expressions, then I think that's a bad
> practice and shouldn't be encouraged.
>
> A agree disallowing ">" chars in attributes greatly simplifies parsing. Not
> only with regular expressions, but any parsing.
> If ">" are allowed, it means that in order to found the end of the element
> you do have to read all attributes before. This is very costy. Just an
> example but they are many others:  let's image you'd like to convert an
> HTML
> document into flat text. To simplify you're algorithm you've chosen  to
> retrieve the content of the <body> element and then to delete all elements
> in it. This is very fast if ">" are not allowed in attributes because
> you're
> able found elements bounds just by searching "<" and then ">".  But if ">"
> are allowed, the operation gets much more complicated, and you spend much
> more time to scan all elements.
>
> In my opinion, the gain of allowing ">" is so poor regarding to the
> troubles
> it makes, that it should be forbidden in both XML and HTML (any version).
>
> > Also take into consideration that even if ">" was forbidden in the
> > spec, it wouldn't mean it doesn't happen in the wild. Since it works in
> browsers, you'd still have to support it if you wanted to parse markup from
> the web.
>
> Allowing it in the spec and how the browser should  behave if it is anyway
> are two different things.
>
> Regards,
> Skrol29
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100625/ce28c32e/attachment.htm>

Received on Friday, 25 June 2010 02:48:03 UTC