Re: Stream stacks.

Pim Lemmens writes:

> Currently I am trying to add robot exclusion support to our fish search
> program. In order to find out what objects are not accessible to robots
> at a given server, the program should load the 'robots.txt' file from
> the server root directory and interprete it. I have incorporated the
> interpreter in a (unstructured) stream that is supposed to receive its
> input in "text/plain" format. A robot exclusion stream object is created
> by
> HTStream* RobExStream_new(char* robot, Queue* NoGoList);
> that reads an ASCII stream and produces a list of forbidden addresses. It
> all has been tested OK, but I can't succeed to make it work in an actual
> application, using the www library. What is wrong with the following:
>     .....
>     HTRequest_setAnchor(request, HTAnchor_findAddress(URL));
>     HTRequest_setOutputFormat(request, WWW_PLAINTEXT);
>     HTRequest_setOutputStream(request, RobExStream_new("fish-search",
>                                                        &((*SE)->NoGoList)));
>     .....
>     dest = (HTAnchor*)HTRequest_anchor(request);
>     .....
>     if (!HTLoadAnchor(dest, request)) /* request not accepted */
>       status = ExceptionHandler(request);
>     else
>       status = HT_OK;
>     .....
> Stream tracing produces the following:
> ...........
> HTNet_new... starting request 7bd30 (retry=2) with net object 7c460
> StreamStack. Constructing stream stack for text/html to text/plain
> HTNetDelete. Object and call callback functions
> HTNetDelete. 7c528 not registered!
> HTNet_delete Remove net object 7c528
> HTNet_delete closing 5
> HTNetDelete. Object and call callback functions
> HTNet_delete Remove net object 7c460
> ...........
> Any idea where that unregistered object comes from?

There are several ways of handling this. First a question from my side - does a server use a speacial content type when it hands of the robots.txt file or do you have to guess? The reason for asking this is that if affects the way you would set it up.

If we have a special content type, say "application/x-www-robots", then you can register your stream as a converter (See the definition in HTFormat.html) with the capabilities of "converting" the contents of the robots.txt file to an internal data structure. In case the server does not provide this information then there are three alternatives:

1) Still use your stream as a converter but instead of converting to "application/x-www-robots", then use it to convert to "text/plain". Then your converter will get called every time you get a plain text object and you can look for magic words to see if this is indeed a robots file.

2) You can register the HTGuess stream as the plain text converter and then add any magic words to this stream. This works exactly as 1) except that your stream does not have to do teh "guessing". The guess stream is defined in HTGuess.c

3) You can use your stream as a generic stream (which is what you have now). However, by adding it to the request object directly you make two assumptions which must be fulfilled:

	a) You assume that this request will indeed give you a file containing robot
	information. In fact you do this by looking at the URL which may not be a good
	idea. Having the server labeling the robots.txt file with a Content-Type header
	is a much more generic solution where you keep the URL opaque.

	b) As you explicitly ask the library to put data data into your stream, you
	must also ask it to bypass the stream stack algorithm (which you are using
	in 1) and 2). You do this by telling the library to give you the _source_
	which is the original content of the object:

		HTRequest_setOutputFormat(request, WWW_SOURCE);

Another thing is how you actually use the information in the robots.txt file.
I suggest that you register a BEFORE callback funcion in the HTNet manager to handle the robots information. By doing that, this function will get called every time a request is set up but before we actually do any interaction with the remote host. In you before callback you can then look if this is an allowed URL or not and then return HT_OK or HT_ERROR as the result. In caxe of HT_ERROR, the request will stop immediately and the remote host will not be contacted.


Henrik Frystyk Nielsen, <frystyk@w3.org>
World-Wide Web Consortium, MIT/LCS NE43-356
545 Technology Square, Cambridge MA 02139, USA