Re: ISSUE-4 - versioning/DOCTYPEs from Aryeh Gregor on 2010-05-17 (public-html@w3.org from May 2010)

From: Aryeh Gregor <Simetrical+w3c@gmail.com>
Date: Mon, 17 May 2010 15:27:15 -0400
To: Henri Sivonen <hsivonen@iki.fi>
Cc: Sam Ruby <rubys@intertwingly.net>, public-html@w3.org, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Boris Zbarsky <bzbarsky@mit.edu>, Daniel Glazman <daniel.glazman@disruptive-innovations.com>
Message-ID: <AANLkTinNGU0-LmSX2a0K6Bk8d4XU22vcwukNi7bzo2T0@mail.gmail.com>

On Mon, May 17, 2010 at 10:55 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> Do you know if people actually process your pages (as opposed to your feeds) using off-the-shelf XML parsers without any prior arrangement with you?

As I've noted before, this does actually happen to Wikipedia.  We get
immediate complaints when we serve a significant number of
non-well-formed pages.  People write tools to help them edit,
automating common tasks, including bots that run completely
autonomously.  For our part, we prefer people use the machine-readable
API instead of screen-scraping, because bots that rely on parsing the
actual web pages tend to break when we change minor things, but people
screen-scrape anyway.

It would be nice if people could use HTML5 parsers instead, but that's
simply not possible yet.  In one case, a tool that broke was actually
using AJAX to scrape the page -- and browsers don't yet support
text/html parsing for AJAX.  More time will be needed before this
use-case is actually obsolete, unfortunately.

Received on Monday, 17 May 2010 19:34:05 UTC