[whatwg] Thoughts on HTML 5 from Benjamin Hawkes-Lewis on 2008-12-18 (public-whatwg-archive@w3.org from December 2008)

From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
Date: Thu, 18 Dec 2008 18:21:18 +0000
Message-ID: <494A949E.3050705@googlemail.com>
Giovanni Campagna wrote:
> 2008/12/18 Benjamin Hawkes-Lewis <bhawkeslewis at googlemail.com 
> <mailto:bhawkeslewis at googlemail.com>>
> 
> 
>     Perhaps (got any actual evidence about author expectations in this
>     case?), but that's not a problem for tokenizer performance. You're
>     "shifting the goalposts".
> 
>  
> My comment about tokenizer performance was later. By the way, author 
> should not expect that invalid markup work in any particular way (in the 
> past they did and wrote specific markup for specific implementation)

Depends on the context of that expectation. Authors should expect 
HTML5-conforming parsers to handle invalid markup precisely as specified 
in the HTML5 specification. They should not expect legacy browsers to do 
the same, but probably will in practice.

>     Anyway, if we're talking authorial expectations, ordinary authors
>     don't expect
> 
>     <a href="http://example.com?foobar&baz <http://example.com?foobar&baz>">
> 
>     to be an unrecoverable error, but it is in XHTML.
> 
>  
> authors didn't expect that example.com?foobar&section=1 
> <http://example.com?foobar&section=1> became example.com?foobar 
> <http://example.com?foobar>?ion=1 but this happened in Netscape and IE 
> quite long ago
> if they got an error, at least they knew that it was not a correct 
> syntax and should have been avoided, since it could lead to different 
> results in different browsers
> (it is not valid HTML, btw)

If the author ever saw the error, which would be dependent on a culture 
of bug reporting.

>     It's not like either of these syntaxes make sense to ordinary people
>     or were even intended to do so. The original authoring model for
>     HTML was supposed to be "paragraph" and "anchor", mediated by some
>     sort of vaguely WYSIWYG type editor, not angle-bracketed tags.
> 
> If you don't like like less-than and greater-than (it is not Unicode 
> angle bracket actually), publish your work in PDF or DOC. 

Those have even worse syntaxes - and as with most HTML, they aren't 
typically written by hand.

> HTML stays for 
> HyperText Markup Language. Markup (i.e. tags) can't be removed.

Not sure what you're arguing here.

You're the one suggesting people move from one serialization (text/html) 
to another (XML) on the basis that one is easier to understand. I'm 
basically saying that XML is still so hard for ordinary people to 
understand that it's not obviously worth the pain of migration on 
ease-of-use grounds and that, in this context, its draconian 
error-handling acts as an incentive to drive people back to text/html. 
They don't understand that either, but at least it works (or at least, 
appears to).

>     A conforming browser will interpret the markup as specified by the
>     specification, so there is no difference.
> 
> Yes, the fact is that the specification itself "guesses" what an average 
> author thinks when it writes HTML

Correct. Likewise for XHTML (guessing that the author meant paragraph 
when a "p" element is supplied and quotation when a "blockquote" element 
is supplied, for example).

>     In practice, people find this very hard for XML and most web
>     publishing systems (WordPress etc.) don't work like this even if
>     they should.
> 
> Why do SQL injections or buffer overrun attacks happen? Because 
> applications don't check for input. The same for XML: you check, you're 
> sure nobody will try to take your site down. You don't check, that's 
> your fault.

Absolutely true; I'm personally a big fan of aggressive input validation 
for text/html.

Input validation is worth advocating on its own grounds, whether for 
HTML or XHTML.

It doesn't automatically follow from this that making end-users suffer 
because of input validation failures where error-recovery could be 
graceful is a good idea.

>     Also, much of the web is ad-supported. The ads ecosystem is based
>     around including markup from trusted sources. Those including the
>     markup are generally not able to exert much control over the
>     included markup, even when they are some of the biggest publishers
>     on the web. Getting ads to have user-friendly HTML (e.g. alt
>     attributes for image links) is nigh impossible; trying to get
>     conforming HTML is a wet dream; and trying to get ads in valid XML
>     is a likely to be a complete non-starter. Why would an ad creator
>     bother, when they could choose a different partner and use their old
>     text/html ads?
> 
> If ad buyer refuses to buy a non-valid-XML ad, probably the ad creator 
> will rewrite them.

You're ignoring the fact that there is a competitive advantage in 
supporting text/html ads, because the most popular browser doesn't 
support application/xhtml+xml and because text/html has non-draconian 
error handling so end-users are more likely to see the ad.

You're best hope would be to identify a competitive advantage for a 
person buying advertising to show a application/xhtml+xml ad that would 
justify a higher risk of users not seeing the ad. Good luck with that. :)

>     "Probably" - got any empirical evidence for that? I don't usually
>     report errors in websites I visit (even _I_ usually have other
>     things to do with my time).
> 
> If any error prevents someone from correctly browsing that page, he 
> first reports that to web owner, then to browser creator.

Does he? Again, I don't normally report such problems at all; I 
typically just go to another (competitor's) site. I suspect most users 
do the same. And I'm sure ad-supported or commercial websites aren't 
interested in converting potential ad revenue or customers into bug 
reports or their competitors' gain.

> If an user complains about a warning (not error) indication, he can 
> disable it (but not security errrors).

If they know how and care. (Actually, you can typically disable a lot of 
security warnings too.)

> On the other hand, some user will 
> complain with the site creator, instead of with the browser creator.

What's the incentive for browser creators? What's the incentive for 
end-users?

>     Ian was effectively asking: "Why deprecate text/html?" You appear to
>     be trying to answer: "How would we deprecate text/html?" which is a
>     different question (and I've indicated some problems with your
>     suggestion above).
> 
> Sorry, I didn't understand (it looked like "we want to deprecate html 
> but we don't have instruments", but it didn't make much sense).
> 
>     Except on the ad-supported web?
> 
> 1) use <iframe>
> 2) use <object>
> 3) use <embed>

Resulting in additional HTTP requests and still requiring use of 
text/html in the including content. Not always an option and doesn't 
actually allow you to deprecate text/html.

> 4) use <img>

Using images for text poses accessibility problems, reduces performance, 
and doesn't help if you want an animated ad or an ad with interactive 
elements. Not an option.

> 5) use well-formed XHTML

Your competitors will support bad HTML, so you will fail to persuade ad 
buyers to use well-formed XHTML. Not an option.

> 6) use JS + DOM

Ads already do use JS + DOM. They especially like using document.write 
to inject strings of text/html.

> Do you think it is enough?

No, I think it's hopelessly unrealistic to be honest.

If the web had begun with draconian error-handling and systems like 
WordPress and ads designed to survive such error-handling, it might have 
been realistic.

Now that we have a _commercial_ ecosystem built on tolerance for broken 
input and broken output and therefore _commercial_ advantages for being 
interoperable with that ecosystem, I think you'd need to fix that 
ecosystem before moving everybody to a serialization with draconian 
error-handling. That means arguing for input validation and conforming 
markup on their own grounds (e.g. security, reliability, accessibility), 
not hitched to the XHTML bandwagon.

--
Benjamin Hawkes-Lewis
Received on Thursday, 18 December 2008 10:21:18 UTC