- From: Matteo Pampolini <Matteo.Pampolini@italtel.it>
- Date: Fri, 03 Dec 1999 14:56:02 +0100
- To: "http-wg@cuckoo.hpl.hp.com" <http-wg@cuckoo.hpl.hp.com>
- Message-ID: <3847CBF1.8FDCE942@medius.settimo.italtel.it>
Hi everybody, this is Matteo Pampolini again from Italy. Few months ago I wrote to this mailing list with the intention of proposing a method for tracing broken links. Thank you very much to everyone that replied to my e-mail, but unfortunately I was the involved in another project that kept me very busy 'til some days ago. Now I prepared a very very draft document with the basis of my idea: I ask anyone interested on it to start a discusion, of course feeling free to tell me any comments and remarks. Good work to everyone, -- Matteo Pampolini (matteo.pampolini@italtel.it) Strategic Planning and Innovation - Advanced Research Italtel S.p.A. 20019 Settimo Milanese (Mi) Italy Phone +39.02.4388.9375 Fax +39.02.4388.9120 ***************************************************** * * * ...but gravity always wins (Radiohead) * * * *****************************************************
M. Pampolini Advanced Research Laboratories Italtel S.p.A. December 1999 A proposed method for tracing broken links Preface This is intentionally a draft and informal document, with the aim to focus on a problem that, in my personal opinion, is growing faster and faster with the expansion of the Internet, and whose effects can be experimented every day when surfing the Net. That is why I decided to think about it, to define a very first proposal to try resolve, at least partially, this problem, conscious that there could be a great deal of improvement on it, especially on the efficiency side (and I really hope there will be) and that many issues are still opened. Unfortunately, though I try to keep, as far as I can, in touch with the World Wide Web evolution, in terms of architecture and new protocols, I?m not actually involved in any IETF working group: that?s why I?d like to have the largest feedback on this topic from the members of HTTP mailing list interested on it. The problem Everyday, when surfing the Net, we follow hyperlinks: however as new sites are daily added to the worldwide database that is the Internet, others are deleted or moved, leading to a great dinamicity in information management, but also to HTML documents that refer to a non-existent URL, thus obtaining the well-known ?404 - Not found? HTTP error. I have to say that this phenomenon was quite rare only few months ago, but now this error appears more frequently on my (and I suppose everyone?s) browser. A possible solution Let?s first say that the proposed solution requires modifications either on the browser side or the HTTP server one: this is probably the major drawback, but it?s possible that with different contributions these modifications could be avoided at least on one side. By now, we will focus on static HTML pages, not created, for example, through CGI scripts or similar, but located in the usual tree structure of personal pages in the server space. So, imagine you have just downloaded from Web server S1 an HTML page, say P1, that contains an hyperlink to page P2 on server S2: you click on it and error 404 is reported. In first analysis this could mean that page P2 doesn?t exist anymore in the space of server S2. Currently at this point you can?t do anything but follow another link, and another surfer on the other side of the world experience the same behaviour. Things would go better if your browser could inform the Web server S1 that the URL referring to page P2 is invalid or ?broken?: this could be done through a special HTTP action. Server S1 then keeps trace of this URL and, if possible, of the page that referenced it. Actually you cannot discriminate if page P2 really doesn?t exist anymore or other problems occurred, so server S1 doesn?t take any decision on action to perform. But if, for example,a reference counter is associated to this broken URL and other browsers in different times reasonably distant each other, say in a week period, inform server S1 that URL relative to page P2 is unreachable, then there?s a good chance that page P2 has been cut off. Now server S1 could do many things: * Automatically send an e-mail to the Webmaster informing him of the broken link: this one could eventually in turn ask the author of page P1 to eliminate the link. * Find a way to inform the next surfer that loads page P1 that the referred page P2 is unavailable. This could be done by parsing page P1 before sending it and adding a new tag, say <BROKEN>, after tag <HREF> in page P1. The browser in turn parses the page and when it finds a broken link it is displayed in a different fashion, meaning ?unavailable, don?t click on it?, or it?s not displayed at all. The process of parsing the page by the Web server could lead to efficiency problems, so it could simply send, maybe formatted as an HTML document, the list, if exists, of links referred by a particular page that appear to be unavailable. The browser receives such a list and behaves as before. Conclusion In this document I tried to give the basic idea of a mechanism useful to avoid following links that aren?t available anymore. Of course many other problems arise from this technique and not the totality of cases is faced. However this just wants to be the starting point for a constructive discussion that, in my optimistic prevision, could lead to a more sofisticated solution of the problem.
Received on Friday, 3 December 1999 06:19:54 UTC