Re: Processing of &prod_id= in attributes from Adam Barth on 2010-06-29 (public-html@w3.org from June 2010)

From: Adam Barth <w3c@adambarth.com>
Date: Tue, 29 Jun 2010 14:43:15 -0700
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Maciej Stachowiak <mjs@apple.com>, Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Message-ID: <AANLkTikdjGsQ7r-CeVzZkiSCkZTPGEEPHjSkhTTOWeSN@mail.gmail.com>

On Tue, Jun 29, 2010 at 2:34 PM, Julian Reschke <julian.reschke@gmx.de> wrote:
> On 29.06.2010 22:39, Adam Barth wrote:
>> On Tue, Jun 29, 2010 at 12:30 PM, Julian Reschke<julian.reschke@gmx.de>
>>  wrote:
>>> On 29.06.2010 21:03, Maciej Stachowiak wrote:
>>>> ...
>>>> I believe the spec matches Minefieled and the WebKit behavior is a bug.
>>>>
>>>> The algorithm you cited has this constraint:
>>>>
>>>> "If the character reference is being consumed as part of an attribute,
>>>> and
>>>> the last character matched is not a U+003B SEMICOLON character (;), and
>>>> the
>>>> next character is either a U+003D EQUALS SIGN character (=) or in the
>>>> range
>>>> U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL
>>>> LETTER
>>>> A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to
>>>> U+007A
>>>> LATIN SMALL LETTER Z, then, for historical reasons, all the characters
>>>> that
>>>> were matched after the U+0026 AMPERSAND character (&) must be
>>>> unconsumed,
>>>> and nothing is returned."
>>>> ...
>>>
>>> Yikes.
>>>
>>> Can somebody translate this into English for me? :-)
>>
>> Sure.
>>
>> It's an unfortunate accident of the world that (1)&  is part of the
>> escape sequence for HTML entities, (2)&  is a common URL delimiter,
>> and (3) HTML attributes decode HTML entities.  Consequently, many
>
> Yes, nothing unclear about that.
>
>> authors copy and paste&  characters into HTML attributes as part of
>> URLs and don't expect the parser to decode HTML entities in their
>> URLs.  This algorithm in the spec catches those cases by not decoding
>> HTML entities if the character after the entity looks like it's more
>> likely to be part of a URL parameter name (or the parameter/value
>> delimiter, "=").
>
> Ok, so that's the rational. That still doesn't tell me what the actual
> algorithm *is*.
>
> In a separate mail, you wrote yourself:
>
>> Ah, thanks.  I missed that subtlety.
>
> where the "subtlety" is that the table contains both well-formed
> (";"-terminated) *and* truncated strings. It might be helpful to say that
> clearly, or even have two different tables, where the one with the truncated
> strings is clearly labeled as to be for compatibility with broken content.

That might be helpful.  Honestly, what would be more helpful would be
to add some of these cases to the HTML5lib test suite.  We've been
adding cases to the WebKit-local version of the test suite as we
implement the algorithm.  When I fix this bug, I'll add some of these
cases too.  Eventually, we'll contribute the WebKit cases back to the
mainline.

The test cases aren't secret or anything, they're just a work in progress:

http://trac.webkit.org/browser/trunk/LayoutTests/html5lib/resources/comments01.dat
http://trac.webkit.org/browser/trunk/LayoutTests/html5lib/resources/doctype01.dat
http://trac.webkit.org/browser/trunk/LayoutTests/html5lib/resources/entities01.dat
http://trac.webkit.org/browser/trunk/LayoutTests/html5lib/resources/entities02.dat
http://trac.webkit.org/browser/trunk/LayoutTests/html5lib/resources/scriptdata01.dat
http://trac.webkit.org/browser/trunk/LayoutTests/html5lib/resources/webkit01.dat

It might be time to start entities03.dat.  Entities are quite complicated.  :)

Adam

Received on Tuesday, 29 June 2010 21:44:05 UTC