W3C home > Mailing lists > Public > public-html@w3.org > May 2010

Re: ISSUE-4 - versioning/DOCTYPEs

From: Aryeh Gregor <Simetrical+w3c@gmail.com>
Date: Mon, 17 May 2010 15:27:15 -0400
Message-ID: <AANLkTinNGU0-LmSX2a0K6Bk8d4XU22vcwukNi7bzo2T0@mail.gmail.com>
To: Henri Sivonen <hsivonen@iki.fi>
Cc: Sam Ruby <rubys@intertwingly.net>, public-html@w3.org, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Boris Zbarsky <bzbarsky@mit.edu>, Daniel Glazman <daniel.glazman@disruptive-innovations.com>
On Mon, May 17, 2010 at 10:55 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> Do you know if people actually process your pages (as opposed to your feeds) using off-the-shelf XML parsers without any prior arrangement with you?

As I've noted before, this does actually happen to Wikipedia.  We get
immediate complaints when we serve a significant number of
non-well-formed pages.  People write tools to help them edit,
automating common tasks, including bots that run completely
autonomously.  For our part, we prefer people use the machine-readable
API instead of screen-scraping, because bots that rely on parsing the
actual web pages tend to break when we change minor things, but people
screen-scrape anyway.

It would be nice if people could use HTML5 parsers instead, but that's
simply not possible yet.  In one case, a tool that broke was actually
using AJAX to scrape the page -- and browsers don't yet support
text/html parsing for AJAX.  More time will be needed before this
use-case is actually obsolete, unfortunately.
Received on Monday, 17 May 2010 19:34:05 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:16:02 UTC