- From: Sam Ruby <rubys@intertwingly.net>
- Date: Mon, 12 Jul 2010 12:03:43 -0400
On Mon, Jul 12, 2010 at 11:41 AM, Julian Reschke <julian.reschke at gmx.de> wrote: > On 12.07.2010 16:43, Mike Wilcox wrote: >> >> On Jul 12, 2010, at 8:39 AM, Nils Dagsson Moskopp wrote: >>> >>>> That's a little different. Google purposely uses unstandardized, >>>> incorrect HTML in ways that still render in a browser in order to >>>> make it more difficult for screen scrapers. They also "break it" in a >>>> different way every week. >>> >>> Assuming this is true (which I find difficult to believe), wouldn't a >>> screen scraper based on the HTML5 parsing algorithm defeat this >>> purpose ? >> >> Honestly, I don't know. But W3 defaulted to an HTML5 validator: >> >> http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0 >> >> <http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0> > > True, but a parser conforming to the spec (*) would handle those errors, so > in this case obfuscation wouldn't work. Essentially, any code using that > parser would see the same information as an off-the-shelf web browser. > >> ... >> Besides the protecting of their API, Google also will scratch and claw >> to save every byte. They are the gold standard of a high performance > > Understood. There's an ongoing controversy whether it makes sense to make > things like these invalid (just stating, not offering an opinion). > >> website. While this may or may not explain the things that don't >> validate, what it does say is that nothing coming from google.com >> <http://google.com> is accidental. >> ... > > I believe some time ago a certain Google employee actually *did* state that > some of the conformance problems were unintentional. (yes, I did spend a few > minutes finding that statement but wasn't successful). http://lists.w3.org/Archives/Public/public-html/2010Mar/0555.html > Best regards, Julian > > (*) Implementing error recovery, which IMHO isn't required. - Sam Ruby
Received on Monday, 12 July 2010 09:03:43 UTC