- From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
- Date: Thu, 18 Dec 2008 18:21:18 +0000
Giovanni Campagna wrote: > 2008/12/18 Benjamin Hawkes-Lewis <bhawkeslewis at googlemail.com > <mailto:bhawkeslewis at googlemail.com>> > > > Perhaps (got any actual evidence about author expectations in this > case?), but that's not a problem for tokenizer performance. You're > "shifting the goalposts". > > > My comment about tokenizer performance was later. By the way, author > should not expect that invalid markup work in any particular way (in the > past they did and wrote specific markup for specific implementation) Depends on the context of that expectation. Authors should expect HTML5-conforming parsers to handle invalid markup precisely as specified in the HTML5 specification. They should not expect legacy browsers to do the same, but probably will in practice. > Anyway, if we're talking authorial expectations, ordinary authors > don't expect > > <a href="http://example.com?foobar&baz <http://example.com?foobar&baz>"> > > to be an unrecoverable error, but it is in XHTML. > > > authors didn't expect that example.com?foobar§ion=1 > <http://example.com?foobar§ion=1> became example.com?foobar > <http://example.com?foobar>?ion=1 but this happened in Netscape and IE > quite long ago > if they got an error, at least they knew that it was not a correct > syntax and should have been avoided, since it could lead to different > results in different browsers > (it is not valid HTML, btw) If the author ever saw the error, which would be dependent on a culture of bug reporting. > It's not like either of these syntaxes make sense to ordinary people > or were even intended to do so. The original authoring model for > HTML was supposed to be "paragraph" and "anchor", mediated by some > sort of vaguely WYSIWYG type editor, not angle-bracketed tags. > > If you don't like like less-than and greater-than (it is not Unicode > angle bracket actually), publish your work in PDF or DOC. Those have even worse syntaxes - and as with most HTML, they aren't typically written by hand. > HTML stays for > HyperText Markup Language. Markup (i.e. tags) can't be removed. Not sure what you're arguing here. You're the one suggesting people move from one serialization (text/html) to another (XML) on the basis that one is easier to understand. I'm basically saying that XML is still so hard for ordinary people to understand that it's not obviously worth the pain of migration on ease-of-use grounds and that, in this context, its draconian error-handling acts as an incentive to drive people back to text/html. They don't understand that either, but at least it works (or at least, appears to). > A conforming browser will interpret the markup as specified by the > specification, so there is no difference. > > Yes, the fact is that the specification itself "guesses" what an average > author thinks when it writes HTML Correct. Likewise for XHTML (guessing that the author meant paragraph when a "p" element is supplied and quotation when a "blockquote" element is supplied, for example). > In practice, people find this very hard for XML and most web > publishing systems (WordPress etc.) don't work like this even if > they should. > > Why do SQL injections or buffer overrun attacks happen? Because > applications don't check for input. The same for XML: you check, you're > sure nobody will try to take your site down. You don't check, that's > your fault. Absolutely true; I'm personally a big fan of aggressive input validation for text/html. Input validation is worth advocating on its own grounds, whether for HTML or XHTML. It doesn't automatically follow from this that making end-users suffer because of input validation failures where error-recovery could be graceful is a good idea. > Also, much of the web is ad-supported. The ads ecosystem is based > around including markup from trusted sources. Those including the > markup are generally not able to exert much control over the > included markup, even when they are some of the biggest publishers > on the web. Getting ads to have user-friendly HTML (e.g. alt > attributes for image links) is nigh impossible; trying to get > conforming HTML is a wet dream; and trying to get ads in valid XML > is a likely to be a complete non-starter. Why would an ad creator > bother, when they could choose a different partner and use their old > text/html ads? > > If ad buyer refuses to buy a non-valid-XML ad, probably the ad creator > will rewrite them. You're ignoring the fact that there is a competitive advantage in supporting text/html ads, because the most popular browser doesn't support application/xhtml+xml and because text/html has non-draconian error handling so end-users are more likely to see the ad. You're best hope would be to identify a competitive advantage for a person buying advertising to show a application/xhtml+xml ad that would justify a higher risk of users not seeing the ad. Good luck with that. :) > "Probably" - got any empirical evidence for that? I don't usually > report errors in websites I visit (even _I_ usually have other > things to do with my time). > > If any error prevents someone from correctly browsing that page, he > first reports that to web owner, then to browser creator. Does he? Again, I don't normally report such problems at all; I typically just go to another (competitor's) site. I suspect most users do the same. And I'm sure ad-supported or commercial websites aren't interested in converting potential ad revenue or customers into bug reports or their competitors' gain. > If an user complains about a warning (not error) indication, he can > disable it (but not security errrors). If they know how and care. (Actually, you can typically disable a lot of security warnings too.) > On the other hand, some user will > complain with the site creator, instead of with the browser creator. What's the incentive for browser creators? What's the incentive for end-users? > Ian was effectively asking: "Why deprecate text/html?" You appear to > be trying to answer: "How would we deprecate text/html?" which is a > different question (and I've indicated some problems with your > suggestion above). > > Sorry, I didn't understand (it looked like "we want to deprecate html > but we don't have instruments", but it didn't make much sense). > > Except on the ad-supported web? > > 1) use <iframe> > 2) use <object> > 3) use <embed> Resulting in additional HTTP requests and still requiring use of text/html in the including content. Not always an option and doesn't actually allow you to deprecate text/html. > 4) use <img> Using images for text poses accessibility problems, reduces performance, and doesn't help if you want an animated ad or an ad with interactive elements. Not an option. > 5) use well-formed XHTML Your competitors will support bad HTML, so you will fail to persuade ad buyers to use well-formed XHTML. Not an option. > 6) use JS + DOM Ads already do use JS + DOM. They especially like using document.write to inject strings of text/html. > Do you think it is enough? No, I think it's hopelessly unrealistic to be honest. If the web had begun with draconian error-handling and systems like WordPress and ads designed to survive such error-handling, it might have been realistic. Now that we have a _commercial_ ecosystem built on tolerance for broken input and broken output and therefore _commercial_ advantages for being interoperable with that ecosystem, I think you'd need to fix that ecosystem before moving everybody to a serialization with draconian error-handling. That means arguing for input validation and conforming markup on their own grounds (e.g. security, reliability, accessibility), not hitched to the XHTML bandwagon. -- Benjamin Hawkes-Lewis
Received on Thursday, 18 December 2008 10:21:18 UTC