[whatwg] Thoughts on HTML 5 from Benjamin Hawkes-Lewis on 2008-12-18 (public-whatwg-archive@w3.org from December 2008)

From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
Date: Thu, 18 Dec 2008 15:47:23 +0000
Message-ID: <494A708B.4010004@googlemail.com>
Giovanni Campagna wrote:
> 2008/12/17 Ian Hickson <ian at hixie.ch>
> 
> 
>     This doesn't cost any time in HTML either, since the tokeniser doesn't
>     need to worry about what tags have end tags, the tree construction side
>     just drops unexpected end tags on the floor.
> 
> I don't think authors expect tags to disappear.

Perhaps (got any actual evidence about author expectations in this 
case?), but that's not a problem for tokenizer performance. You're 
"shifting the goalposts".

Anyway, if we're talking authorial expectations, ordinary authors don't 
expect

<a href="http://example.com?foobar&baz">

to be an unrecoverable error, but it is in XHTML.

It's not like either of these syntaxes make sense to ordinary people or 
were even intended to do so. The original authoring model for HTML was 
supposed to be "paragraph" and "anchor", mediated by some sort of 
vaguely WYSIWYG type editor, not angle-bracketed tags.

>      > don't check for insertion modes
> 
>     Having an insertion mode isn't particularly a performance cost. (It
>     affects code footprint, but that's about it.)
> 
> 1) it needs more code (one x insertion mode): more code is always less 
> performance, even if it is just to load a bigger executable
> 2) it needs code to  select the insertion mode for the next element 
> (when the spec says  to reset the insertion mode): in the worst case it 
> has to compare nodeName 18 times
>  > That's the same as HTML.
> No it is not. HTML defines special beaviour for the following elements: 
>  address, area, article, aside, base, basefont, bgsound, blockquote, 
> body, br, center, col, colgroup, command, datagrid, dd, details, dialog, 
> dir, div, dl, dt, embed, eventsource fieldset, figure, footer, form, 
> frame, frameset, h1, h2, h3, h4, h5, h6, head, header, hr, iframe, img, 
> input, isindex, li, link, listing, menu, meta, nav, noembed, noframes, 
> noscript, ol, p, param, plaintext, pre, script, section, select, spacer, 
> style, tbody, textarea, tfoot, thead, title, tr, ul, and wbr.
> I think they're quite too many to say that it is like XML
> 
>  > There are a number of HTML5 parser implementations, and data suggests 
> that
>  > there is no particular performance gain.
> There are no actual HTML5 parser implementation, only HTML4 compatible 
> with new syntax.

Ahem, there are several:

http://www.google.com/search?q=html5+parser

>  > There's no guessing in HTML either; all input streams have very specific
>  > and required results.
> Actually, there's nothing that really says that <div><p>some 
> text</p><p>some more text</p></div> is more correct than <div><p>some 
> text<p>some more text</p></p></div>
> 
> Just when writing the specification you guess that the first possibility 
> is what auctor thought. You are guessing, not the browser.

A conforming browser will interpret the markup as specified by the 
specification, so there is no difference.

> Every input, even from the most 
> trustworthy source, must be parsed for errors and then checked after 
> publishing.

In practice, people find this very hard for XML and most web publishing 
systems (WordPress etc.) don't work like this even if they should.

Also, much of the web is ad-supported. The ads ecosystem is based around 
including markup from trusted sources. Those including the markup are 
generally not able to exert much control over the included markup, even 
when they are some of the biggest publishers on the web. Getting ads to 
have user-friendly HTML (e.g. alt attributes for image links) is nigh 
impossible; trying to get conforming HTML is a wet dream; and trying to 
get ads in valid XML is a likely to be a complete non-starter. Why would 
an ad creator bother, when they could choose a different partner and use 
their old text/html ads?

> And if an end user finds an error, he probably will report it to the 
> owner of the web site, who in turn will report it (quite angrily) to web 
> designer. Something like: "What on earth are you doing in front of the 
> coffe machine? I don't pay you to rest! Fix that website immediately!

"Probably" - got any empirical evidence for that? I don't usually report 
errors in websites I visit (even _I_ usually have other things to do 
with my time).

In any case, avoiding angry customers complaining because XML threw a 
fatal error that would have been handled gracefully in HTML is an 
infinitely stronger incentive for developers to keep using text/html 
than anything the spec might say on the matter, so this isn't a 
persuasive argument for switching to application/xhtml+xml.

>  > Well, they've ignored it for the past 7 years, so why would they change?
> Nobody said to user that he was browsing a deprecate web site. If 
> something like IE7 information bar (ie. a non modal bar, disactivable 
> and not annoying the user, but immediately visible) could appear in a 
>  web site sent with  text/html,  I think companies won't like their site 
> tagged as "deprecate" and port them to application/xhtml+xml in no time 
> (do you imagine what "deprecate" can mean on news web site?)

Indeed, they would be upset. And they might even try porting it.

However, there's little incentive for browser makers to throw 
information bars over the majority of the existing web just to assuage 
your desire for people to switch to XML.

In fact, there are disincentives for browser vendors to include such an 
information bar since:

1. Users will complain about error messages about sites that have always 
worked just fine. ("I'm switching back to IE8.")

2. Users will be trained to ignore error messages since sites work just 
fine even with a finger-wagging information bar slapped across the top, 
which is a security risk.

Even persuading browser vendors to include an indication of whether a 
website is valid or not has been a non-starter for every browser except 
iCab - and even iCab has dropped that indication in the latest version.


>  > Anyway, it isn't clear that we would _want_ to deprecate HTML, even if we
>  > had any real choice in the matter.
> 
> I'm not sure if I understood your sentence (sorry, English is not my 
> mother language). Anyway, you just have to put an "authoring 
> requirement" for text/html

Ian was effectively asking: "Why deprecate text/html?" You appear to be 
trying to answer: "How would we deprecate text/html?" which is a 
different question (and I've indicated some problems with your 
suggestion above).

> Gradually, n? 3 will disappear, because there's no actual needing for HTML.

Except on the ad-supported web?

--
Benjamin Hawkes-Lewis
Received on Thursday, 18 December 2008 07:47:23 UTC