Re: Semicolon after entities

On Sat, 28 Apr 2007, Lachlan Hunt wrote:

>> For some odd reason, Lynx displays "∥" as "PP". It's not the only 
>> browser that recognizes references for entities not defined in HTML 4.
>
> Which UAs and which entity references?

Does it matter? I haven't kept a record, but there was a time when 
browsers supported references like &emdash; and &brkbar;, and I wouldn't 
be surprised if some of them still did, for compatibility. By the way, 
Lynx seems to have a large number of such entities; the document
http://lynx.isc.org/current/lynx2-8-7/src/chrtrans/entities.h
lists many of them.

(It treats ∥ as denoting PARALLEL TO. The fallback rendering of "PP" 
surprises me.)

The issue was whether such behavior is a bug.

>> It's not a bug, because there is no mandatory error processing.
>
> Lack of defined error handling is one of the most serious issues with HTML4, 
> and in reality, at least as far as interoperability is concerned, HTML4 is 
> irrelevant.

Yet HTML 4 is the closest that we have to a useful "standard" on HTML. Or 
would you rather use ISO HTML? :-)

The issue was whether a particular behavior is a bug. To me, a bug is 
program behavior that deviates from the requirements (or, in a more narrow 
sense, an _unintential_ deviation). Only after having defined what 
constitutes correct behavior can you call something a bug. You can't just 
call something a bug because you don't like it.

> HTML5 is defining error handling for entity references, which is based upon 
> the error handling used by the major browsers.

As I wrote earlier, mandatory error handling is _effectively_ part of the 
language definition. If a specification says, for example, that an entity 
reference must always be terminated by a semicolon and you specify that 
browsers must yet treat an unterminated reference as if it were 
terminated, you have for all practical purposes made the semicolon 
optional (in certain conditions).

> It would be sensible for lynx 
> to implement HTML parsing more interoperably with other UAs, and the best 
> chance they have of doing that, is following HTML5.

This is not about parsing but about resolving entity references.

>> When a browser sees, say, &emdash; or —, it may - as far as HTML 4 
>> specifications are concerned - apply any error handling it likes,
>> including implicit fix to —.
>
> What?  Are you saying that &emdash; and — should be silently treated 
> the same as —, or am I misunderstanding you?

The part you quoted says that thet _could_ be treated that way. Other 
options include not displaying the document at all, showing a blue 
screen with blinking red text "WRONG", and starting to play Towers of 
Hanoi.

In the part of my message that quoted later (see below), I wrote that such 
processing would generally be the best, with some reservations. It might 
be a good idea to display a small generic error indicator on the screen, 
so that clicking on it shows information about the error. But most users 
wouldn't care about the error indicator (especially since the majority of 
pages would have it) or couldn't make use of it, so it's almost as good to 
just silently do what the author meant.

> In this case, however, the reality is that major browsers output unknown 
> entity references literally, without trying to expand them.  So &emdash; is 
> treated equivalent to &emdash;.  That is also how HTML5 defines error 
> handling for it.

Is that useful? The odds that the author wanted such a display are very 
small. Defining error handling that way effectively means that authors can 
write the ampersand as unescaped if it is followed by a name that is not 
one of the defined entity names. This isn't useful since it encourages 
sloppy coding and isn't compatible with future extensions or with 
browser-specific extensions.

>> We might even argue that this is the _best_ error processing strategy in 
>> practice, since that's probably what the author meant, and if it
>> isn't, we have little odds of achieving anything better using some
>> including implicit fix to —.
>
> Actually, in practice, when an author uses an undefined entity reference, 
> it's usually because they forgot to encode & and & and expect UAs to 
> ouput it literally, exactly the way most browsers do.

That might be true for query parts of URLs, but what are the odds of 
someone using "emdash" as a form field name, as opposite to someone 
making a fairly natural mistake of typing the entity reference for the em 
dash character as &emdash;?

>> Besides, ∥ or &emdash; isn't really an error by SGML rules, which is 
>> what HTML 4 is nominally based on. They are just undefined. :-)
>
> SGML rules for HTML are irrelevant these days.

Rules are relevant if you call something a bug. As the Romans said, nullum 
crimen sine lege - no crime without a law. If you refuse to recognize HTML 
4 as being closest to the _current definition_ of HTML, how can you call 
_anything_ a bug? You can't just make up new rules, treating some sketchy 
draft as if it were a standard, and call violations of those rules bugs.

Strictly speaking, when a browser does not process a document according to 
an HTML (or CSS or whatever) specification, we can call this a bug only if 
the browser vendors _claims conformance_, i.e. declares that the browser 
conforms, or tries to conform, to the specification.

There are actually several specifications for HTML, including HTML 4.01, 
XHTML 1.0, and XHTML 1.1 (none of which has been obsoluted) as well as the 
ISO HTML standard. Do browsers claim to conform to some of them? If not, 
you are not allowed to judge the browsers by them.

Similarly, HTML 5, if it will ever be a complete draft and then approved 
by some organization, would be just one player in the field. There will be 
little interest in it by most authors, if the dominant browser will not 
conform to it or make any serious attempt at conformance. It might start 
the next round of browser wars, though.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Saturday, 28 April 2007 13:22:34 UTC