- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Sat, 28 Apr 2007 16:22:25 +0300 (EEST)
- To: www-html@w3.org
On Sat, 28 Apr 2007, Lachlan Hunt wrote: >> For some odd reason, Lynx displays "∥" as "PP". It's not the only >> browser that recognizes references for entities not defined in HTML 4. > > Which UAs and which entity references? Does it matter? I haven't kept a record, but there was a time when browsers supported references like &emdash; and &brkbar;, and I wouldn't be surprised if some of them still did, for compatibility. By the way, Lynx seems to have a large number of such entities; the document http://lynx.isc.org/current/lynx2-8-7/src/chrtrans/entities.h lists many of them. (It treats ∥ as denoting PARALLEL TO. The fallback rendering of "PP" surprises me.) The issue was whether such behavior is a bug. >> It's not a bug, because there is no mandatory error processing. > > Lack of defined error handling is one of the most serious issues with HTML4, > and in reality, at least as far as interoperability is concerned, HTML4 is > irrelevant. Yet HTML 4 is the closest that we have to a useful "standard" on HTML. Or would you rather use ISO HTML? :-) The issue was whether a particular behavior is a bug. To me, a bug is program behavior that deviates from the requirements (or, in a more narrow sense, an _unintential_ deviation). Only after having defined what constitutes correct behavior can you call something a bug. You can't just call something a bug because you don't like it. > HTML5 is defining error handling for entity references, which is based upon > the error handling used by the major browsers. As I wrote earlier, mandatory error handling is _effectively_ part of the language definition. If a specification says, for example, that an entity reference must always be terminated by a semicolon and you specify that browsers must yet treat an unterminated reference as if it were terminated, you have for all practical purposes made the semicolon optional (in certain conditions). > It would be sensible for lynx > to implement HTML parsing more interoperably with other UAs, and the best > chance they have of doing that, is following HTML5. This is not about parsing but about resolving entity references. >> When a browser sees, say, &emdash; or &MDASH;, it may - as far as HTML 4 >> specifications are concerned - apply any error handling it likes, >> including implicit fix to —. > > What? Are you saying that &emdash; and &MDASH; should be silently treated > the same as —, or am I misunderstanding you? The part you quoted says that thet _could_ be treated that way. Other options include not displaying the document at all, showing a blue screen with blinking red text "WRONG", and starting to play Towers of Hanoi. In the part of my message that quoted later (see below), I wrote that such processing would generally be the best, with some reservations. It might be a good idea to display a small generic error indicator on the screen, so that clicking on it shows information about the error. But most users wouldn't care about the error indicator (especially since the majority of pages would have it) or couldn't make use of it, so it's almost as good to just silently do what the author meant. > In this case, however, the reality is that major browsers output unknown > entity references literally, without trying to expand them. So &emdash; is > treated equivalent to &emdash;. That is also how HTML5 defines error > handling for it. Is that useful? The odds that the author wanted such a display are very small. Defining error handling that way effectively means that authors can write the ampersand as unescaped if it is followed by a name that is not one of the defined entity names. This isn't useful since it encourages sloppy coding and isn't compatible with future extensions or with browser-specific extensions. >> We might even argue that this is the _best_ error processing strategy in >> practice, since that's probably what the author meant, and if it >> isn't, we have little odds of achieving anything better using some >> including implicit fix to —. > > Actually, in practice, when an author uses an undefined entity reference, > it's usually because they forgot to encode & and & and expect UAs to > ouput it literally, exactly the way most browsers do. That might be true for query parts of URLs, but what are the odds of someone using "emdash" as a form field name, as opposite to someone making a fairly natural mistake of typing the entity reference for the em dash character as &emdash;? >> Besides, ∥ or &emdash; isn't really an error by SGML rules, which is >> what HTML 4 is nominally based on. They are just undefined. :-) > > SGML rules for HTML are irrelevant these days. Rules are relevant if you call something a bug. As the Romans said, nullum crimen sine lege - no crime without a law. If you refuse to recognize HTML 4 as being closest to the _current definition_ of HTML, how can you call _anything_ a bug? You can't just make up new rules, treating some sketchy draft as if it were a standard, and call violations of those rules bugs. Strictly speaking, when a browser does not process a document according to an HTML (or CSS or whatever) specification, we can call this a bug only if the browser vendors _claims conformance_, i.e. declares that the browser conforms, or tries to conform, to the specification. There are actually several specifications for HTML, including HTML 4.01, XHTML 1.0, and XHTML 1.1 (none of which has been obsoluted) as well as the ISO HTML standard. Do browsers claim to conform to some of them? If not, you are not allowed to judge the browsers by them. Similarly, HTML 5, if it will ever be a complete draft and then approved by some organization, would be just one player in the field. There will be little interest in it by most authors, if the dominant browser will not conform to it or make any serious attempt at conformance. It might start the next round of browser wars, though. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Saturday, 28 April 2007 13:22:34 UTC