Re: Semicolon after entities from Lachlan Hunt on 2007-04-28 (www-html@w3.org from April 2007)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Sun, 29 Apr 2007 02:51:59 +1000
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
CC: www-html@w3.org
Message-ID: <46337BAF.80702@lachy.id.au>
Jukka K. Korpela wrote:
> On Sat, 28 Apr 2007, Lachlan Hunt wrote:
>>> For some odd reason, Lynx displays "&par;" as "PP". It's not the only 
>>> browser that recognizes references for entities not defined in HTML 4.
>>
>> Which UAs and which entity references?
> 
> Does it matter?

Only if the relevant UAs are of any significance today and those 
particular entities are in use.

> I haven't kept a record, but there was a time when 
> browsers supported references like &emdash; and &brkbar;,

I tested IE, Firefox, Opera, Safari, OmniWeb and iCab.  Of those, only 
iCab supported them.

> and I wouldn't be surprised if some of them still did, for compatibility.

Considering that none of the major browsers support those, there 
probably isn't a significant about of content in existence that relies them.

> The issue was whether such behavior is a bug.
> 
>>> It's not a bug, because there is no mandatory error processing.
>>
>> Lack of defined error handling is one of the most serious issues with 
>> HTML4, and in reality, at least as far as interoperability is 
>> concerned, HTML4 is irrelevant.
> 
> Yet HTML 4 is the closest that we have to a useful "standard" on HTML. 
> Or would you rather use ISO HTML? :-)

A spec's official status shouldn't be given much weight in light of 
evidence that shows the spec is irrelevant in the real world.

> The issue was whether a particular behavior is a bug. To me, a bug is 
> program behavior that deviates from the requirements

Yes.

> Only after having defined what constitutes correct behavior can you 
> call something a bug. You can't just call something a bug because 
> you don't like it.

The sensible definition for correct behaviour, is the behaviour that is 
required to be compatible with the web and interoperable with other 
browsers.

>> HTML5 is defining error handling for entity references, which is based 
>> upon the error handling used by the major browsers.
> 
> As I wrote earlier, mandatory error handling is _effectively_ part of 
> the language definition. If a specification says, for example, that an 
> entity reference must always be terminated by a semicolon and you 
> specify that browsers must yet treat an unterminated reference as if it 
> were terminated, you have for all practical purposes made the semicolon 
> optional (in certain conditions).

As I see it, there are 3 approaches to error handling that the spec 
could take:

1. Leave error handling undefined, like HTML4 and XHTML2.

That is clearly unacceptable, because it just leads to the situation we 
are in now, where browsers have spent years reverse engineering each other.

2. Draconian error handling, or at least handling that inflicts 
mysterious error messages upon unsuspecting users.

Draconian error handling, like XML, could not possibly be introduced for 
HTML at this stage.  It would render 93% [*] of the web useless. 
Displaying error messages to users, even if processing doesn't 
completely abort, isn't user friendly in an environment where most users 
wouldn't have a clue what they meant.  But, even if error messages are 
shown, the spec would still have to define the result.

[*] Figure is based on a study of of 3 billion documents by Ian Hickson.

3. Graceful error handling, where exact processing is defined in a way 
that is compatible with the web and all UAs can implement it interoperably.

This is the ideal situation.  It achieves the goal that browsers have 
been striving for by reverse engineering each other in the past, retains 
compatibility with the web and results in a spec which, if implemented 
by any future UA, can be used to render existing pages.

>> It would be sensible for lynx to implement HTML parsing more 
>> interoperably with other UAs, and the best chance they have of doing 
>> that, is following HTML5.
> 
> This is not about parsing but about resolving entity references.

In HTML5, expansion of entity references occurs within the parsing 
algorithm.

>>> When a browser sees, say, &emdash; or &MDASH;, it may - as far as 
>>> HTML 4 specifications are concerned - apply any error handling it likes,
>>> including implicit fix to &mdash;.
>>
>> What?  Are you saying that &emdash; and &MDASH; should be silently 
>> treated the same as &mdash;, or am I misunderstanding you?
> 
> The part you quoted says that thet _could_ be treated that way. Other 
> options include not displaying the document at all, showing a blue 
> screen with blinking red text "WRONG", and starting to play Towers of 
> Hanoi.

>> In this case, however, the reality is that major browsers output 
>> unknown entity references literally, without trying to expand them.  
>> So &emdash; is treated equivalent to &amp;emdash;.  That is also how 
>> HTML5 defines error handling for it.
> 
> Is that useful?

It doesn't matter if it's the most theoretically useful output, it's 
what browsers do now, and changing such behaviour could potentionally 
result in billions of pages breaking.

> The odds that the author wanted such a display are very small.

Perhaps in this one case, you could make that arugment, but in the 
general case of &foo;, it's impossible to know what the author actually 
meant.

> Defining error handling that way effectively means that authors 
> can write the ampersand as unescaped if it is followed by a name that is 
> not one of the defined entity names. This isn't useful since it 
> encourages sloppy coding and isn't compatible with future extensions or 
> with browser-specific extensions.

I would argue that it is useful, but regardless of that, any other 
behaviour would not be compatible with the web, so there isn't really a 
choice in the matter.

>> Actually, in practice, when an author uses an undefined entity 
>> reference, it's usually because they forgot to encode & and &amp; and 
>> expect UAs to ouput it literally, exactly the way most browsers do.
> 
> That might be true for query parts of URLs, but what are the odds of 
> someone using "emdash" as a form field name,

Let's assume for a moment that that behavior was compatible with the 
web.  How would it even be possible to implement in the general case? 
Sure, UAs could easily hard code "emdash" and possibly a few other 
cases, but there are hundreds of entity references and even more ways of 
slightly mistyping them.  If such behaviour were to be implemented, the 
precise algorithm would need to be specced.  That would be unbelievably 
complex, if not impossible.

>>> Besides, &par; or &emdash; isn't really an error by SGML rules, which 
>>> is what HTML 4 is nominally based on. They are just undefined. :-)
>>
>> SGML rules for HTML are irrelevant these days.
> 
> Rules are relevant if you call something a bug. As the Romans said, 
> nullum crimen sine lege - no crime without a law. If you refuse to 
> recognize HTML 4 as being closest to the _current definition_ of HTML, 
> how can you call _anything_ a bug? You can't just make up new rules, 
> treating some sketchy draft as if it were a standard, and call 
> violations of those rules bugs.

I didn't call anything a bug according to rules defined in HTML4, or any 
other spec for that matter.  I called it a bug based on the reality of 
the situation, which is that Lynx's behaviour in this case is 
incompatible with that of every other browser I tested.

> There are actually several specifications for HTML, including HTML 4.01, 
> XHTML 1.0, and XHTML 1.1 (none of which has been obsoluted) as well as 
> the ISO HTML standard. Do browsers claim to conform to some of them? If 
> not, you are not allowed to judge the browsers by them.

Every single HTML spec in existence from HTML 2.0 to HTML 4.01 and XHTML 
1.0, 1.1 and 2.0, regardless of their official status, either is, or is 
very close to being, irrelevant in the real world.

> Similarly, HTML 5, if it will ever be a complete draft and then approved 
> by some organization, would be just one player in the field.

Regardless of what you may think, and regardless of its official status, 
HTML5 is the only really relevant HTML spec in existence for 
implementers these days.

> There will be little interest in it by most authors, if the dominant 
> browser will not conform to it or make any serious attempt at conformance. 
> It might start the next round of browser wars, though.

The development of the HTML5 spec has the support of at least 4 major 
browser vendors (IE, Mozilla, Opera and Safari).  None of them are 
interested in another round of browser wars.

-- 
Lachlan Hunt
http://lachy.id.au/
Received on Saturday, 28 April 2007 16:52:10 UTC