Some initial questions from Dennis Gallagher on 1996-08-20 (www-lib@w3.org from July to September 1996)

From: Dennis Gallagher <galron@seanet.com>
Date: Tue, 20 Aug 1996 11:40:33 -0700
To: "'WWW Library discussion group'" <www-lib@w3.org>
Message-ID: <01BB8E8C.6A32A450@galron.seanet.com>

Hello to all.  I've just joined this discussion group and I have some questions I hope someone might answer.  

I'm writing a Win32 program in C using VC++ 4.x which will download HTML pages into local buffers where I will  parse data out of them.  As part of this, I've read the HTTP 1.0 document from end to end.

My first effort was successful.  I was able to connect a TCP/IP port to a remote host and then to use the HEAD and GET methods to determine the web page's size and download it.  My next effort was to try to access a web page behind an Basic Authentication barrier.  Here I've run into problems that I don't understand.

I send a request with the HEAD method and the remote responds that I am not authorized as expected.  I then quiz my user for the ID and Password, encode them with the Base64 method and resubmit my request as a GET method with credentials attached.  The remote responds with 200 (OK) and I read the page in.  The problem is I always get only part of the page.  On the original response from the server, he tells me how big the page will be as well as telling me I'm not authorized.  It is this page size that is always short.  Sometimes, it is only 200 to 300 bytes, others it is 1600 or so but always it is only part of what I'm expecting.

Many times, after I've just received this partial page, I request the same page using NS or IE so I can look at the source and compare what I got vs what they got.  Sometimes, it looks like I have exactly what they got but short.  Other times, it looks like I have a page which is similar but not identical to what I expected.

Needless to say, all of this is quite baffling.  The site I'm trying to access provides real time stock quotes and the page I'm trying to download is:

http://mw.dbc.com/cgi-bin/htx.exe/mw/main.html

Is there something in this URL that might be a clue as to why I'm having problems?

For awhile, I thought maybe my problem was that I was not escaping unacceptable characters in the URL but I'm doing it now and it has made no difference.

This is getting long so I'll wrap up.  My method is basically:

connect to server
escape unacceptable chars in the page path
form HEAD request
send HEAD request
alloc small buffer for response
read response
check status code for OK or unauthorized
if (unauthorized)
	get ID & password
endif
get page size from response
free small buffer
reconnect to server
form GET request (with credentials if nec)
send request
alloc buffer for incoming page
do first read
check status code for OK
if (bufferfull)
	exit
endif
loop
	read more page in (advancing pointers)
	if (charsread=0)
exit
	endif
endloop

I realize that the libwww is supose to do much of this low level stuff for me but I had begun this project before I discovered it.  I may switch over (I have some questions I'll post in a different message) but I'd like to know first why this isn't working.

Thanks,

Dennis Gallagher
galron@seanet.com

Received on Tuesday, 20 August 1996 14:41:47 UTC