Re: Semicolon after entities from Lachlan Hunt on 2007-04-30 (www-html@w3.org from April 2007)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Mon, 30 Apr 2007 23:02:44 +1000
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
CC: www-html@w3.org
Message-ID: <4635E8F4.4040507@lachy.id.au>
Jukka K. Korpela wrote:
> On Sun, 29 Apr 2007, Lachlan Hunt wrote:
>> Considering that none of the major browsers support those, there 
>> probably isn't a significant [amount] of content in existence that 
>> relies [on] them.
> 
> Since we have a notation like &emdash; which has no defined meaning by 
> existing HTML specs and has been used and described in the past as 
> denoting the em dash, and since this notation may spontaneously arise 
> from a typo or misremembering the name, is there really any reason
> a) not to display it as if &mdash; had been countered
> b) to declare behavior a) as incorrect?

If that error handling were to be defined and implemented, it would 
require significant research to determine the cost and benefit of making 
the change.  Considering that none of the 4 major browsers implement it 
that way, it strongly suggests that they do not consider it important 
for compatibility with content on the web.  Consistent error handling 
among browsers is also more important than logical error handling.

>>> Yet HTML 4 is the closest that we have to a useful "standard" on 
>>> HTML. Or would you rather use ISO HTML? :-)
>>
>> A spec's official status shouldn't be given much weight in light of 
>> evidence that shows the spec is irrelevant in the real world.
> 
> So what _do_ you use as the current HTML specification? Do you construct 
> it from the observed behavior of web browsers?

That is effectively the process that was used to define the parsing 
requirements for HTML5.

> And what do you do when  browsers disagree? The market leader takes all?
> Your favorite browser takes all?

Research is done on a case by case basis to determine the most sensible 
approach to take.  Many factors are taken into consideration, such as 
the impact that a change would have on existing content, the complexity 
and the ability of browser vendors to implement it.

See, for instance, some Hixie's articles discussing the way browsers 
handle various markup errors, which shows just some of the extensive 
research that has been done to develop the spec over the years.

Tag Soup: How UAs handle <x> <y> </x> </y>
http://ln.hixie.ch/?start=1037910467&count=1

Tag Soup: Crazy parsing adventures
http://ln.hixie.ch/?start=1137740632&count=1

Tag Soup: Blocks-in-inlines
http://ln.hixie.ch/?start=1138169545&count=1

Tag Soup: appendChild() of a script that calls document.write()
http://ln.hixie.ch/?start=1155195074&count=1

Tag Soup: innerHTML interoperability (or lack thereof)
http://ln.hixie.ch/?start=1158969783&count=1

>> The sensible definition for correct behaviour, is the behaviour that 
>> is required to be compatible with the web and interoperable with other 
>> browsers.
> 
> This seems to be a contrived way of saying that in your opinion, correct 
> behavior is what most browsers do. Finally a way to make most browsers 
> comply! :-) Or maybe not, since they differ too much.

In theory, it would be really nice if browsers could implement the 
existing specs exactly as they are written.  However, in reality, that 
simply is not possible and there's no point trying to be idealistic in 
the face of reality.

> "Compatible with the web" seems to be your pet phrase. It sounds 
> impressive, but what does it mean?

It means being able to parse and render the existing content on the web 
in a way that meets the expectations of both users and authors, which 
are based upon their experiences with legacy browsers.

> "Interoperability with other browsers" is not a reasonable criterion. 
> First, browsers don't really interoperate.

That is a serious problem that HTML5 is trying to address.

> Second, what you really mean is working the same way as other browsers.

Yes, that's what interoperability means.

> What's the point of having different browsers if they are all required
> to work the same way [...]?

To improve competition in the browser market, which in turn promotes 
choice and innovation.

Think about this from a users' perspective.  You should be able to 
choose what browser you wish to use, for whatever reason you like.  It 
could be easy to use, have useful extensions, good looking themes, or 
whatever.  Just take a look at how many Gecko based browsers there are, 
all of which offer different features.  The rendering engine isn't the 
most important feature in the eyes of typical users.

Users just expect their browser to display pages well and they shouldn't 
have to use different browsers for accessing different sites.

The reality of the situation is that a significant portion of the web is 
built using extremely broken HTML, yet most users simply do not know or 
care about that.  Browsers have a responsibility to handle errors in 
pages because it's what users implicitly expect.

When different browsers handle errors differently, it increases the 
chances of pages working in some browsers, while being incredibly broken 
in others.  That limits, or even prevents, the users ability to choose 
the browser they want to use.

> This will get horrendously complex. Presumably you don't think that 
> Firefox and Opera and IE 7 are grossly incorrect since they differ in 
> _many_ ways from IE 6, which is still the most common browser.

All browsers have bugs, but we need to determine the most reasonable 
behaviour based on the way existing browsers work.

>> As I see it, there are 3 approaches to error handling that the spec 
>> could take:
>>
>> 1. Leave error handling undefined, like HTML4 and XHTML2.
>>
>> That is clearly unacceptable, because it just leads to the situation 
>> we are in now, where browsers have spent years reverse engineering 
>> each other.
> 
> No, the fact that browsers have imitated each other (most importantly, 
> other browsers have imitated IE) is _not_ a result of undefined error 
> handling.

Yes it is a direct result of undefined error handling.  Authors build 
pages that rely on bugs in whatever browser they use, so in the eyes of 
the author, the buggy browser behaviour is correct.  That has been 
happening since the early days of the web.  When a user tries to view 
such pages in their browser, that doesn't have the same bugs that are 
relied upon in the authors browser, the page breaks.  That is exactly 
why browser vendors have been reverse engineering each other.

>  I can't see your logic here.

If you think I'm wrong, then please explain why you think browsers have 
reversed engineered each other?

>> 3. Graceful error handling, where exact processing is defined in a way 
>> that is compatible with the web and all UAs can implement it 
>> interoperably.
> 
> If you define exact processing and make it mandatory, how will the 
> situation differ from defining the errors as language features? Except 
> in wording, that is.

In some cases, we did exactly that.  For example, using the XML empty 
element syntax like <br/> was initially considered an error in HTML5 and 
had completely different processing on HTML4.  But considering that the 
practice is so widespread, there are many authoring tools that output 
that syntax which could be costly to fix and upgrade, its use is 
harmless in reality, and some authors actually like using it, it was 
decided to make the syntax conforming.  So authors can either use <br> 
or <br/> (and similarly for other empty elements).

Another example is that charactrers like '/' can now be used in unquoted 
attribute values.  e.g. <a href=http//example.com/>...</a>.  In HTML4, 
that was an error, but no browser handled it according to SGML rules. 
There is no benefit in disallowing such a widely supported and used 
syntax, so it too was made conforming.

Yet there are things that are clearly errors.  For example, omitting an 
end tag from an element that doesn't allow it.  For inline elements, 
compatibility restricitions require that unclosed elements get reopened 
after their parent closes.  Although such errors occur very often, the 
required error handling isn't particularly sensible and it's often not 
what the author intended (though, sometimes it is, as in the case of 
<b><i>foo</b></i>)

>>>> In this case, however, the reality is that major browsers output 
>>>> unknown entity references literally, without trying to expand them. 
>>>> So &emdash; is treated equivalent to &amp;emdash;.  That is also how 
>>>> HTML5 defines error handling for it.
>>>
>>> Is that useful?
>>
>> It doesn't matter if it's the most theoretically useful output,
> 
> "Useful" is a practical concept.

Usually, yes.  But the error handling you're advocating for &emdash; is 
probably not practical in the real world (though, as I said above, it 
would require research to know for sure).

>> it's what browsers do now,
> 
> Mostly, yes.
> 
>> and changing such behaviour could potentionally result in billions of 
>> pages breaking.
> 
> No, there aren't billions of pages with &emdash; in them. Those that 
> have it _mean_ the em dash, so most browsers display the page as broken, 
> i.e. as contrary to what the author surely meant. Some browsers do 
> otherwise, and HTML5 wants to prohibit that.

At least by defining the correct behaviour, the error handling will be 
consistent in all browsers, even if the result isn't exactly what the 
author intended.  If it's not, the author should fix it.

>> If such behaviour were to be implemented, the precise algorithm would 
>> need to be specced.
> 
> No, definitely not. Error handling is an area where different strategies 
> can be applied.

Wrong!  Error handling is one of the most important things to do 
interoperably.  Consider CSS, which does define precisely how to handle 
errors gracefully.  That is one case where the spec got it right and 
many browsers to handle syntax errors interoperably (even though some 
browsers haven't quite got there yet).

>> Every single HTML spec in existence from HTML 2.0 to HTML 4.01 and 
>> XHTML 1.0, 1.1 and 2.0, regardless of their official status, either 
>> is, or is very close to being, irrelevant in the real world.
> 
> That's nonsense and you know it. Just because they contain unimplemented 
> features doesn't make them irrelevant.

Those specs contain unimplemented features because they cannot be 
implemented.  Some things are left undefined, others are defined in ways 
that aren't compatible with the content on the web.  As far as 
implementers are concerned, they cannot implement any one of those specs 
exactly as written and expect to be usable in the real world.

>> Regardless of what you may think, and regardless of its official 
>> status, HTML5 is the only really relevant HTML spec in existence for 
>> implementers these days.
> 
> Despite not existing? It's not even close to a draft specification, just 
> a discussion document.

It is a draft specification, despite not being published on w3.org yet. 
  Browsers are much closer to implementing it, than they are to 
implementing HTML4 properly.

>>> There will be little interest in it by most authors, if the dominant 
>>> browser will not conform to it or make any serious attempt at 
>>> conformance. It might start the next round of browser wars, though.
>>
>> The development of the HTML5 spec has the support of at least 4 major 
>> browser vendors (IE, Mozilla, Opera and Safari).  None of them are 
>> interested in another round of browser wars.
> 
> Really? Where can I read Microsoft's commitment to HTML5?

Microsoft have several representatives in the HTMLWG, including Chris 
Wilson who is a co-chair, so they are clearly in support of the 
development of HTML.  Chris indicated that he has no problem with using 
the WHATWG's work as the basis for the HTMLWG.  Even though he doesn't 
agree with everything in the spec as is, he said it would be a 
disservice to not make use of it.

http://lists.w3.org/Archives/Public/public-html/2007Apr/1240.html

-- 
Lachlan Hunt
http://lachy.id.au/
Received on Monday, 30 April 2007 13:02:57 UTC