GRDDL and non-XML HTML [was: Agenda ...] from Dan Connolly on 2006-12-13 (public-grddl-wg@w3.org from December 2006)

From: Dan Connolly <connolly@w3.org>
Date: Wed, 13 Dec 2006 10:59:10 -0500
To: Fabien Gandon <Fabien.Gandon@sophia.inria.fr>
Cc: public-grddl-wg <public-grddl-wg@w3.org>
Message-Id: <f8c3120f4d99fbe0a84d8f98a08b843b@w3.org>

On Dec 13, 2006, at 9:49 AM, Fabien Gandon wrote:
> Harry Halpin a écrit :
>>     4. GRDDL and (non-XML) HTML
>>           + ACTION:Fabien to add a tidy/tag-soup use case/paragraph,
>>             with caveats
> I commited a first draft:
> http://jigedit.w3.org/fgandon/WWW/2001/sw/grddl-wg/doc43/scenario- 
> gallery.htm#html_tidy_use_case

aka
http://www.w3.org/2001/sw/grddl-wg/doc43/scenario- 
gallery.htm#html_tidy_use_case

I think this does a really nice job without blurring what GRDDL is:

"Because most of these web pages are HTML and not XHTML and because  
most of the time they are not even valid HTML, the script first checks  
if each page is a well-formed XML document. If the page is indeed a  
well-formed XML document the script just calls a GRDDL agent on this  
page to extract metadata it may contain.

If the page is not a well-formed XML document the script proceeds with  
calling an HTML-tidying tool that retrieves the page, cleans it the  
best it can and provides an XHTML version. The script saves these XHTML  
versions locally making sure that the base URL of each local copy is  
specified and if not the script sets it to the URL of the initial HTML  
page. Finally the script calls a GRDDL agent on each local copy to  
extract the metadata they may contain."

I'm not sure I like having "scraping" in the section heading...

  Use case #8 - Scraping the web: Steffen wants to build a directory of  
the people he works with.

But I guess this _is_ scraping... hmm...

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Received on Wednesday, 13 December 2006 15:59:17 UTC