- From: Aryeh Gregor <Simetrical+w3c@gmail.com>
- Date: Thu, 12 Nov 2009 10:14:29 -0500
On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote: > I assume you meant "mostly" as in "most of the pages are well-formed", not > "pages are mostly well-formed", since the latter is useless, right? > > I did a brief survey of obvious sites fitting those descriptions that I had > in my browser history at the moment. . . . > > So either you're looking at a totally different dataset or "mostly" is a bit > of a stretch.... I admit I didn't look closely. At a guess, maybe the default WordPress skin(s) are valid XHTML, but custom skins are very popular for WordPress and those mostly aren't valid XHTML? MediaWiki is unreasonably difficult to reskin, so that's not much of a problem for us . . . > Sure. ?0.01% of all websites is a "significant number". ?I just think it's > broken often enough, and easy enough to break by accident, that relying on > it working for screen scraping is not likely to be happening on a wide > scale.... You're probably right. > Or stop using HTML named entities, yes. That's not really a very good option, given the size of MediaWiki's code base and the size of Wikipedia's database, and the ugliness of trying to remember what   is when reading the HTML source. It sounds like we're stuck with a legacy doctype if we don't want to break screen-scrapers.
Received on Thursday, 12 November 2009 07:14:29 UTC