- From: Dan Connolly <connolly@w3.org>
- Date: Wed, 13 Dec 2006 10:59:10 -0500
- To: Fabien Gandon <Fabien.Gandon@sophia.inria.fr>
- Cc: public-grddl-wg <public-grddl-wg@w3.org>
On Dec 13, 2006, at 9:49 AM, Fabien Gandon wrote: > Harry Halpin a écrit : >> 4. GRDDL and (non-XML) HTML >> + ACTION:Fabien to add a tidy/tag-soup use case/paragraph, >> with caveats > I commited a first draft: > http://jigedit.w3.org/fgandon/WWW/2001/sw/grddl-wg/doc43/scenario- > gallery.htm#html_tidy_use_case aka http://www.w3.org/2001/sw/grddl-wg/doc43/scenario- gallery.htm#html_tidy_use_case I think this does a really nice job without blurring what GRDDL is: "Because most of these web pages are HTML and not XHTML and because most of the time they are not even valid HTML, the script first checks if each page is a well-formed XML document. If the page is indeed a well-formed XML document the script just calls a GRDDL agent on this page to extract metadata it may contain. If the page is not a well-formed XML document the script proceeds with calling an HTML-tidying tool that retrieves the page, cleans it the best it can and provides an XHTML version. The script saves these XHTML versions locally making sure that the base URL of each local copy is specified and if not the script sets it to the URL of the initial HTML page. Finally the script calls a GRDDL agent on each local copy to extract the metadata they may contain." I'm not sure I like having "scraping" in the section heading... Use case #8 - Scraping the web: Steffen wants to build a directory of the people he works with. But I guess this _is_ scraping... hmm... -- Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Wednesday, 13 December 2006 15:59:17 UTC