W3C home > Mailing lists > Public > www-lib@w3.org > July to September 2003

RE: Parsing local html files

From: Tim Serong <tim.serong@conceiva.com>
Date: Tue, 5 Aug 2003 09:47:59 +1000
Message-ID: <8501919721C3DE4C81BCA22846B08721BCAF@lazarus.conceiva.com>
To: "Subramanyam Mallela" <mallela@parc.com>, <www-lib@w3.org>

Hi,

The simplest thing to do is supply a file URL (something like
file:///foo or file:///c:/foo.txt on Windows) for the Request, rather
than an HTTP URL.  libwww should then read the file from disk.

Alternately, you can hack up something like this (please excuse the C++
style):

    // Declare HTStream, so you can write to it directly
  typedef struct _HTStream
  {
    HTStreamClass * isa;
  } HTStream;

  ...

  HText_registerLinkCallback(myFoundLink);
    // register any other required callbacks here
  HTRequest * r = HTRequest_new();
    // this base URL will be used for resolving links
    // in the file being parsed
  HTRequest_setAnchor(r, HTAnchor_findAddress("http://baseurl/"));
  HTStream * parseStream = HTMLPresent(r, 0, WWW_HTML, WWW_PRESENT, 0);
  FILE * fp = fopen("the file", "rb");
  char buf[4096];
  while (!feof(fp))
  {
    size_t bytes = fread(buf, 1, 4096, fp);
    (*parseStream->isa->put_block)(parseStream, buf, bytes);
  }
  fclose(fp);
  (*parseStream->isa->_free)(parseStream);
  HTRequest_delete(r);

Using the above method, you should not even need to initialize the
library itself, but you'll have to free some things at the end manually
if you don't, at the very least:

  HTAnchor_deleteAll(0);
  HTAtom_deleteAll();

If you want to parse another file with a different base URL, but without
creating a new request object each time, free the stream, change the
anchor, create the stream again, then write data to it via put_block:

  (*parseStream->isa->_free)(parseStream);
  HTRequest_setAnchor(r, HTAnchor_findAddress("http://somethingelse/"));
  HTStream * parseStream = HTMLPresent(r, 0, WWW_HTML, WWW_PRESENT, 0);
  (*parseStream->isa->put_block)(...);

Using the above method on Windows, I only had to link against wwwcore,
wwwdll, wwwhtml and wwwutils, rather than all the libraries.  There may
of course be more elegant solutions...

Regards,

Tim Serong
-- 
tim.serong@conceiva.com
http://www.conceiva.com


> -----Original Message-----
> From: Subramanyam Mallela [mailto:mallela@parc.com]
> Sent: Tuesday, 5 August 2003 07:30
> To: www-lib@w3.org
> Subject: Parsing local html files
> 
> 
> 
> 
> Hi
>     how can I use the libwww HTML parser for 
>     parsing local files on the disk. 
>     I don't need to download and use rest of the 
>     code ?
>     Is there any example code for this.
> 
>     Thanks for any help
>     Manyam
> 
> 
Received on Monday, 4 August 2003 19:44:23 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:18:43 GMT