Re: ISSUE-4 - versioning/DOCTYPEs

On Mon, May 17, 2010 at 10:55 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> Do you know if people actually process your pages (as opposed to your feeds) using off-the-shelf XML parsers without any prior arrangement with you?

As I've noted before, this does actually happen to Wikipedia.  We get
immediate complaints when we serve a significant number of
non-well-formed pages.  People write tools to help them edit,
automating common tasks, including bots that run completely
autonomously.  For our part, we prefer people use the machine-readable
API instead of screen-scraping, because bots that rely on parsing the
actual web pages tend to break when we change minor things, but people
screen-scrape anyway.

It would be nice if people could use HTML5 parsers instead, but that's
simply not possible yet.  In one case, a tool that broke was actually
using AJAX to scrape the page -- and browsers don't yet support
text/html parsing for AJAX.  More time will be needed before this
use-case is actually obsolete, unfortunately.

Received on Monday, 17 May 2010 19:34:05 UTC