Re: [whatwg] Should ambiguous ampersand be a parse error? from Boris Zbarsky on 2013-12-10 (public-whatwg-archive@w3.org from December 2013)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Tue, 10 Dec 2013 12:45:16 -0500
To: whatwg@lists.whatwg.org
Message-ID: <52A7532C.3080801@mit.edu>

On 12/10/13 11:11 AM, Peter Cashin wrote:
> The HTML5 spec says that an ambiguous ampersand (e.g. &something; undefined) is not allowed in element content

Right, that's an authoring requirement.

> and in section on HTML parsing, that this should throw a parse error.

There is no throwing of parse errors in the HTML spec.

I assume you're looking at the "anything else" case of
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#consume-a-character-reference
? This says, for the case you're looking at:

If no match can be made, then no characters are consumed, and nothing
is returned. In this case, if the characters after the U+0026
AMPERSAND character (&) consist of a sequence of one or more
alphanumeric ASCII characters followed by a U+003B SEMICOLON
character (;), then this is a parse error.

And if you follow the link to "parse error" it's
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error
and basically has to do with validators needing to report them and UAs
being allowed (but not required) to stop parsing here if they really
want. If they do NOT want to abort on the error (which is the common
case, btw), the spec defines how they press on.

And the way they press on is by returning nothing from the "consume a
character reference" algorithm. What that does depends on the caller,
but in the case you're talking about that's presumably
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#character-reference-in-data-state
and what it will do if nothing is returned is emit the '&' and move on
to the next character. So basically treats the '&' as not special in
any way in this case, leading to the behavior you observe in browsers.

> Is the specification intended to have compliant HTML agents stop parsing ambiguous ampersands?

Compliant HTML agents are allowed to do so, I guess, per the technical
rules about parse errors, just like for any other parse error. But I
expect that this is at least partly for conformance classes other than
"browsers"; all browsers press on through parse errors in HTML. Maybe
the allowed behavior for parse errors should be made conditional on
conformance class...

-Boris

Received on Tuesday, 10 December 2013 17:45:44 UTC