[whatwg] Entity parsing

On Sat, 16 Jun 2007 15:30:07 +0200, Anne van Kesteren <annevk at opera.com>  
wrote:

>> No, IE doesn't break them, and that's the point.
>>
>> Section 8.2.3.1. states "This definition is used when parsing entities  
>> in text and in attributes." - if I understand this correctly, this  
>> makes semicolon optional for entities in both attributes and text and  
>> "&region" in attribute would be interpreted as "?ion".
>> If that's the case, it is not compatible with IE, because it parses  
>> entities differently in attributes and text. In attributes semicolon  
>> (any non-alphanumeric character actually) is required, but in text it  
>> is not.
>>
>> In IE6 <a href="&region">&region</a> is equivalent to <a  
>> href="&amp;region">?ion</a>
>
> Awesome. Guess we have to reverse engineer that too then...

    http://simon.html5.org/test/html/parsing/entities/trailing-semicolon/

The tests aren't really digestable in their current state unless you know  
what they're doing, but well, I'll just say what the results are below. I  
might create proper test cases on this later when this is specced.


Entity parsing works the same in different attributes (tested <img alt>  
and <a href>).

Any character that is not in the range [a-zA-Z0-9] ends an entity -- i.e.,  
the following are equivalent:

    <img alt="&AElig.">
    <img alt="&AElig;.">

...and the following are equivalent:

    <img alt="&AElig1">
    <img alt="&amp;AElig1">


This means that the semi-colon is not part of the entity name, and we need  
to revert to the old entity table and instead have a third column that  
says which entities always require a semi-colon.

You consume as many characters as possible that match the entity table,  
and for the longest match, check if the next character is in the  
abovementioned range. If yes, emit the consumed characters, otherwise emit  
the entity, or something along those lines.

-- 
Simon Pieters

Received on Monday, 18 June 2007 03:47:57 UTC