- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 27 Oct 2011 13:25:08 +0300
- To: Stéphane Corlosquet <scorlosquet@gmail.com>
- Cc: Gregg Kellogg <gregg@kellogg-assoc.com>, Guha <guha@google.com>, Manu Sporny <msporny@digitalbazaar.com>, HTML Data Task Force WG <public-html-data-tf@w3.org>
On Tue, Oct 25, 2011 at 7:47 PM, Stéphane Corlosquet <scorlosquet@gmail.com> wrote: > On Tue, Oct 25, 2011 at 4:00 AM, Henri Sivonen <hsivonen@iki.fi> wrote: >> Note that HTML5 does not try to be backwards-compatible with the HTML >> 4.01 spec. It tries to be compatible with existing content. That is, >> it tries to be compatible with content that's actually on the Web--not >> with content that one could construct based on the HTML 4.01 spec. > > Thanks for raising this point, Henri. You bring an interesting perspective. > I'm curious to know how similar decisions are made in the context of HTML5. > How does a public working group such a WHATWG (which afaik does not have the > resources to index the whole web) go about deciding what feature or markup > pattern can be dropped from a spec? Are there representative samples that > you use? or is it merely based on what feedback you get from people who > "show up" and give feedback to the working group? Do browser vendors such as > Mozilla have any ability to help here? What do you do about deep pages > hidden behind password or a noindex courtesy? Extrapolate the findings from > the public web? The RDFa WG is seeking ways to assess what patterns are used > or not used in the wild (tangible numbers or % tend to carry a lot of > weight) so any hint would help. First of all, there aren't agreed-upon standards of evidence. The first step really boils down to convincing the people who can effect change in browser code to dare to change the code as the first step. As a second step, if a change to a browser is made, it requires the change to stick around for release (i.e. no one who can reverse the change freaks out too much over early reports of breakage). First, how to convince people who can effect change to try to make a change? The first heuristic is using existing browser behavior as a proxy indicator of what browsers need to be like in order to successfully process existing content. If Gecko, WebKit, Trident and Presto all do the same thing, we generally tend to accept this as evidence that either that behavior was necessary in order to successfully render Web content or if it wasn't initially, it is by now because Web authors do all kinds of things and let stuff that "works" stick and this particular thing now "works" in all the four engines. For example, the kind of craziness discussed at https://plus.google.com/u/0/107429617152575897589/posts/TY8zGybVos4 arises from content depending on a bug in ancient Netscape versions. OTOH, the standards mode behavior for <p><table> discussed at http://hsivonen.iki.fi/last-html-quirk/ wasn't required by existing content (if you exclude Acid2 from existing content) when it got universally implemented but by now there's probably enough standards-mode content that depends on it that it's no longer worthwhile to rock the boat and try to make the standards mode behave like the quirks mode. So when the four engines behave the same, that's usually it and we don't change stuff without a *really* good reason. But to give a counter-example, even universal behavior has gotten change on security and efficiency grounds: Before HTML5, HTML parser in browser rewound input and reparsed in a different mode if they hit the end of file inside a script element or inside a comment. (And style, title and textarea, though less consistently at least in the case of title.) In HTML5, we changed this in order to implement a defense in depth feature against attacks based on the attacker forcing a premature end of file (which would make different parts of the page source be interpreted as script) and in order to make the parser less crazy. The obvious de-crazying of the area failed. It broke JS Beautifier, for example. We had a high-level idea of what an alternative design would have to look like. In this case, ideas were first tested by running regular expression over data obtained by crawling pages linked to by the Open Directory Project. A solution I suggested failed miserably. The winning solution was suggested by Simon Pieters of Opera. From testing with pages linked to from the ODP, we knew that the solution would break a few pages out there. I was convinced that the breakage was small enough that it was worthwhile to try to it in Firefox. I was also in a position where I was able to make the in-browser experiment happen. During the Firefox 4 beta period, I received only *one* report of in-the-wild breakage. It was on a bank site, which usually makes people freak out. However, since a low level of breakage was expected, we left the code it and since then there haven't been any other reports about the issue reported against Firefox. IIRC, when Chrome implemented the same thing, they found one Google property triggering a problem and had the Google property fixed. When all four engines don't already agree, Firefox doing something or Safari doing something can be taken as evidence that the behavior is safe and other browsers could adopt it, too. If Opera alone does something, it's generally not convincing enough, because Opera has rather low market share, so Opera behaving a particular way isn't *alone* convincing evidence that the behavior is successful on the Web. If IE alone does something, it's fairly convincing, but by now there's so much code that browser-sniffs IE vs. everyone else, so it could be that IE's behavior is only successful on IE-specific code paths. However, we've used "IE and Opera do something" as convincing enough evidence to change Gecko and WebKit. We've used the "Safari does it and seems to get away with it" heuristic multiple times. For example, unifying the namespace of HTML elements in documents parsed from text/html and document parsed from application/xhtml+xml was something that I dared to try in Firefox, because Safari did it and was getting away with it. Making the change in Firefox broke Facebook, because Facebook browser-sniffed and served different code to all four engines. However, Facebook was phenomenally responsive and fixed their code quickly. And this brings us to the topic of what you can break (i.e. what makes people freak out in the second step when a change is being tested). On one hand, breaking major sites seems really scary and something that seems like and obvious thing not to do. However, particularly Facebook and Google push the limits of the platform much more than others and also have engineers continuously working on stuff. So if a change only breaks a particular line of code on Facebook or a particular line of code on a Google property, it may be possible to change a browser anyway and get Facebook or Google to change their code a little bit. Breaking even a couple of major sites that aren't as big as Facebook or Google and whose daily activities doesn't revolve around pushing the limits of the Web platform is generally something that's not OK. OTOH breaking a couple of long-tail sites might be OK. But breaking a large number of long-tail sites is not OK. In particular, breaking output from one authoring tool when the output has been spread all over the Web is generally not OK even if none of it was on major sites. (But it may happen without anyone noticing early enough and the breakage sticks.) Some vocabulary design decisions and early parsing algorithm design decisions were based experiment performed at Google using the data Google has from crawling the Web. Also, so decisions (I gave one example above) were informed by running experiments of dotbot (http://www.dotnetdotcom.org/) data or data downloaded by taking URLs from the Open Directory Project. Using dotbot or ODP data isn't generally a good idea if you are investigating something that's so rare on the Web at the time of doing the research that you see almost none of it in small-scale general-purpose crawls. For example, the decision not to support prefixed SVG elements in text/html was informed by downloading and parsing all the SVG files in Wikimedia Commons, because it seemed likely that if an SVG authoring tool was popular, some content authored with it would have found its way into Wikimedia Commons. So how would this apply to RDFa? Most of the above doesn't apply, except that it's not particularly productive to try to come up with some rules of evidence with percentage occurrence cut-offs in advance. To be realistic, RDFa has much less legacy than HTML (what an understatement), so it might not be particularly worthwhile to put a lot of effort into saving the RDFa legacy, because RDFa doesn't yet have a vast body of interoperable content being consumed by different major consumers. For example, OGP data is consumed mainly by Facebook's code and the v vocabulary is consumed mainly by Google's Rich Snippets. If the RDFa community doesn't want to throw away that legacy, it might make sense to see how Facebook consumes OGP data (hard-wired prefix; xmlns:foo ignored) and spec that for OGP consumption and see how Rich Snippets consumes v and spec that for processing v data. (Or if it feels wrong to grandfather corporation-specific rules for Facebook and Google stuff, stop pretending that those are part of RDFa. Already neither Facebook nor Google implements RDFa as specced, so that stuff never really was RDFa anyway.) And then do a crawl analogous to the Wikimedia commons crawl for SVG to discover what RDFa not make for Facebook or Google looks like and generalizing about that as if it was a separate format from OGP and the v vocabulary. (I don't know what would be to long-tail RDFa what Wikimedia commons is for SVG, though, as a way of locating stuff without having to do a Google-scale crawl.) -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 27 October 2011 10:25:39 UTC