- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Thu, 12 Nov 2009 00:33:17 -0500
On 11/11/09 11:57 PM, Aryeh Gregor wrote: > A number of popular web apps output mostly well-formed XML, as far as > I know: vBulletin, WordPress, etc. I assume you meant "mostly" as in "most of the pages are well-formed", not "pages are mostly well-formed", since the latter is useless, right? I did a brief survey of obvious sites fitting those descriptions that I had in my browser history at the moment. These were not-well-formed: http://www.dria.org/wordpress/archives/2009/11/10/1043/ http://bisdaktech.wordpress.com/ http://weekinthenee.wordpress.com/2009/11/11/sitting-in-a-park-in-paris-france/ http://terrytao.wordpress.com/2009/10/29/displaying-mathematics-on-the-web/ http://ehren.wordpress.com/2009/10/24/a-gcc-hack-my-0-1-release/ http://www.nvnews.net/vbulletin/showthread.php?t=104201 http://www.nvnews.net/vbulletin/showthread.php?t=132449 These are: http://boomswaggerboom.wordpress.com/ http://fiber-space.de/wordpress/?p=1016 http://dafizilla.wordpress.com/2009/11/08/karmic-koala-hides-firefox-context-menuitems-icons/ So either you're looking at a totally different dataset or "mostly" is a bit of a stretch.... > Not even close to most websites, of course, but a significant number, I'd think. Sure. 0.01% of all websites is a "significant number". I just think it's broken often enough, and easy enough to break by accident, that relying on it working for screen scraping is not likely to be happening on a wide scale.... >> Yes, but browsers would have to add explicit support for it. > > That mostly defeats the point -- they could equally add explicit > support for non-XML responseXML first. Yep. > This makes it sound like if Wikipedia switches to HTML5 and isn't > willing to break all screen-scrapers on principle, we'll have to use > an obsolete but conforming doctype. Or stop using HTML named entities, yes. -Boris
Received on Wednesday, 11 November 2009 21:33:17 UTC