- From: Aryeh Gregor <Simetrical+w3c@gmail.com>
- Date: Mon, 17 May 2010 15:27:15 -0400
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: Sam Ruby <rubys@intertwingly.net>, public-html@w3.org, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Boris Zbarsky <bzbarsky@mit.edu>, Daniel Glazman <daniel.glazman@disruptive-innovations.com>
On Mon, May 17, 2010 at 10:55 AM, Henri Sivonen <hsivonen@iki.fi> wrote: > Do you know if people actually process your pages (as opposed to your feeds) using off-the-shelf XML parsers without any prior arrangement with you? As I've noted before, this does actually happen to Wikipedia. We get immediate complaints when we serve a significant number of non-well-formed pages. People write tools to help them edit, automating common tasks, including bots that run completely autonomously. For our part, we prefer people use the machine-readable API instead of screen-scraping, because bots that rely on parsing the actual web pages tend to break when we change minor things, but people screen-scrape anyway. It would be nice if people could use HTML5 parsers instead, but that's simply not possible yet. In one case, a tool that broke was actually using AJAX to scrape the page -- and browsers don't yet support text/html parsing for AJAX. More time will be needed before this use-case is actually obsolete, unfortunately.
Received on Monday, 17 May 2010 19:34:05 UTC