Re: Processing of &prod_id= in attributes from Julian Reschke on 2010-06-29 (public-html@w3.org from June 2010)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Tue, 29 Jun 2010 23:34:12 +0200
To: Adam Barth <w3c@adambarth.com>
CC: Maciej Stachowiak <mjs@apple.com>, Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Message-ID: <4C2A66D4.7030206@gmx.de>
On 29.06.2010 22:39, Adam Barth wrote:
> On Tue, Jun 29, 2010 at 12:30 PM, Julian Reschke<julian.reschke@gmx.de>  wrote:
>> On 29.06.2010 21:03, Maciej Stachowiak wrote:
>>> ...
>>> I believe the spec matches Minefieled and the WebKit behavior is a bug.
>>>
>>> The algorithm you cited has this constraint:
>>>
>>> "If the character reference is being consumed as part of an attribute, and
>>> the last character matched is not a U+003B SEMICOLON character (;), and the
>>> next character is either a U+003D EQUALS SIGN character (=) or in the range
>>> U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER
>>> A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A
>>> LATIN SMALL LETTER Z, then, for historical reasons, all the characters that
>>> were matched after the U+0026 AMPERSAND character (&) must be unconsumed,
>>> and nothing is returned."
>>> ...
>>
>> Yikes.
>>
>> Can somebody translate this into English for me? :-)
>
> Sure.
>
> It's an unfortunate accident of the world that (1)&  is part of the
> escape sequence for HTML entities, (2)&  is a common URL delimiter,
> and (3) HTML attributes decode HTML entities.  Consequently, many

Yes, nothing unclear about that.

> authors copy and paste&  characters into HTML attributes as part of
> URLs and don't expect the parser to decode HTML entities in their
> URLs.  This algorithm in the spec catches those cases by not decoding
> HTML entities if the character after the entity looks like it's more
> likely to be part of a URL parameter name (or the parameter/value
> delimiter, "=").

Ok, so that's the rational. That still doesn't tell me what the actual 
algorithm *is*.

In a separate mail, you wrote yourself:

> Ah, thanks.  I missed that subtlety.

where the "subtlety" is that the table contains both well-formed 
(";"-terminated) *and* truncated strings. It might be helpful to say 
that clearly, or even have two different tables, where the one with the 
truncated strings is clearly labeled as to be for compatibility with 
broken content.

Anyway, back to the text:

"If the character reference is being consumed as part of an attribute, 
and the last character matched is not a U+003B SEMICOLON character (;),..."

...so if the matched string is from the "compatibility table"...

"...and the next character is either a U+003D EQUALS SIGN character (=) 
or in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 
LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN 
SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical 
reasons, all the characters that were matched after the U+0026 AMPERSAND 
character (&) must be unconsumed, and nothing is returned.

Otherwise, a character reference is parsed. If the last character 
matched is not a U+003B SEMICOLON character (;), there is a parse error."

I'll also point out that saying

  ALPHANUM = DIGIT / ALPHA

once, and then to use that instead of

"U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL 
LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER 
A to U+007A LATIN SMALL LETTER Z"

will make this much more readable.

That being said, I'm now also confused about conformance requirements. 
Consider the @href:

   http://example.com/foo?bar=baz&poundid=qux

"pound" is in the compat table, the last character matched wasn't ";" 
and the next ("i") is alphanumeric; so it sounds this is not a parse error?

Finally, the remark for "historical reasons" is confusing; it sounds 
like we are disallowing some incorrectly written character references 
for historical reasons, while, as far as I can tell, we are *accepting* 
some of those for historical reasons (because of broken content accepted 
by deployed UAs).


Best regards, Julian
Received on Tuesday, 29 June 2010 21:34:51 UTC