[whatwg] Entity parsing

On 25 Jun 2007, at 8:28AM, Ian Hickson wrote:

> On Sun, 24 Jun 2007, ?istein E. Andersen wrote:
>> HTML5 currently follows IE7 much more closely than Safari, 
>>Firefox and Opera do, which seems to suggest that some of the quirks 
>>could be dispensed with.
> It's possible, though people kept pointing out problems, which is how we 
> ended up where we are now.

I have probably missed parts of this discussion, but most of the arguments
I have seen seem to rely on the assumption that whatever IE does is more
compatible with the Web as it is, which is probably a good approximation,
but replicating each single detail is not necessarily the best thing to do.

> Calling SGML "sensible" is a slippery slope! :-)

Sure, I did not mean to imply that all aspects of SGML are sensible :-)

(Bad connotations aside, SGML?s rules for optional semicolons
happen to be less contrived than IE?s.)

>> [It might be a good idea to accept a missing semicolon at the end of words.]
> Well, we'd have to prove this somehow with real research.

Yes, research is really missing here.

Whatever we do, some pages will break, and it is not a priori impossible
that a compromise of IE and SGML rules may be less quirky and more
compatible with existing content at the same time.

I am unable to do a proper corpus study on this, but the following
examples suggest that following IE blindly may not be optimal.
All markup is extracted from real Web pages, and the author?s intent
was quite obvious from the context. The numbers in parentheses indicate
the number of pages found using Google.

I] Should be expanded

    1) only SGML expands
                IE (incorrect): &mdash
                SGML (correct): ?

    2) only IE expands
            fianc&eacutee (390), caf&eacutes (1,460), na&iumlve (716)
                IE (correct): fianc?e, caf?s, na?ve
                SGML (incorrect): fianc&eacutee, caf&eacutes, na&iumlve

    3) neither expands
            &oeliguvre (719), c&oeligur (3,720)
                both (incorrect): &oeliguvre, c&oeligur
                intended: ?uvre, c?ur

II] Should not be expanded

    1) IE expands
            moral&ethics, roses&thorns
                IE (incorrect): moral?ics, roses?s
                SGML (correct): moral&ethics, roses&thorns

    2) SGML expands
            Alpha&Omega, once&forall
                IE (correct): Alpha&Omega, once&forall
                SGML (incorrect): Alpha?, once?

    3) both expand
                both (incorrect): rose?
                intended: rose&thorn

The examples I have found in category II] are all quite rare, but it is not unlikely
that more common ones exist.

Opera and Google both seem to err on the side of caution by only expanding
entities when both IE and SGML do, i.e., in case II.3) above.

It is also interesting to notice that reasonably common words belonging to class
I.2), which are handled by IE, are apparently no more frequent than words from I.3),
which no (popular) current browser handles correctly.

I am looking forward to seeing more extensive research on this.

?istein E. Andersen

Received on Monday, 25 June 2007 17:50:39 UTC