Please feedback: a proposed method for tracing broken links from Matteo Pampolini on 1999-12-03 (ietf-http-wg@w3.org from October to December 1999)

From: Matteo Pampolini <Matteo.Pampolini@italtel.it>
Date: Fri, 03 Dec 1999 14:56:02 +0100
To: "http-wg@cuckoo.hpl.hp.com" <http-wg@cuckoo.hpl.hp.com>
Message-ID: <3847CBF1.8FDCE942@medius.settimo.italtel.it>

Hi everybody,

this is Matteo Pampolini again from Italy. Few months ago I wrote
to this mailing list with the intention of proposing a method for
tracing broken links.

Thank you very much to everyone that replied to my e-mail, but
unfortunately I was the involved in another project that kept me
very busy 'til some days ago.

Now I prepared a very very draft document with the basis of my
idea: I ask anyone interested on it to start a discusion, of course
feeling free to tell me any comments  and remarks.

Good work to everyone,

--
Matteo Pampolini (matteo.pampolini@italtel.it)

Strategic Planning and Innovation - Advanced Research

Italtel S.p.A. 20019 Settimo Milanese (Mi) Italy

Phone  +39.02.4388.9375  Fax  +39.02.4388.9120

*****************************************************
*                                                   *
* ...but gravity always wins (Radiohead)            *
*                                                   *
*****************************************************

M. Pampolini
Advanced Research Laboratories
Italtel S.p.A.
December 1999


A proposed method for tracing broken links

Preface

This is intentionally a draft and informal document, with the aim to 
focus on a problem that, in my personal opinion, is growing faster and 
faster with the expansion of the Internet, and whose effects can be 
experimented every day when surfing the Net.

That is why I decided to think about it, to define a very first proposal 
to try resolve, at least partially, this problem, conscious that there 
could be a great deal of improvement on it, especially on the efficiency 
side (and I really hope there will be) and that many issues are still 
opened.

Unfortunately, though I try to keep, as far as I can, in touch with the 
World Wide Web evolution, in terms of architecture and new protocols, I?m 
not actually involved in any IETF working group: that?s why I?d like to 
have the largest feedback on this topic from the members of HTTP mailing 
list interested on it.

The problem

Everyday, when surfing the Net, we follow hyperlinks: however as new 
sites are daily added to the worldwide database that is the Internet, 
others are deleted or moved, leading to a great dinamicity in information 
management, but also to HTML documents that refer to a non-existent URL, 
thus obtaining the well-known ?404 - Not found? HTTP error.

I have to say that this phenomenon was quite rare only few months ago, 
but now this error appears more frequently on my (and I suppose 
everyone?s) browser.

A possible solution

Let?s first say that the proposed solution requires modifications either 
on the browser side or the HTTP server one: this is probably the major 
drawback, but it?s possible that with different contributions these 
modifications could be avoided at least on one side.

By now, we will focus on static HTML pages, not created, for example, 
through CGI scripts or similar, but located in the usual tree structure 
of personal pages in the server space.

So, imagine you have just downloaded from Web server S1 an HTML page, say 
P1, that contains an hyperlink to page P2 on server S2: you click on it 
and error 404 is reported. In first analysis this could mean that page P2 
doesn?t exist anymore in the space of server S2. Currently at this point 
you can?t do anything but follow another link, and another surfer on the 
other side of the world experience the same behaviour.

Things would go better if your browser could inform the Web server S1 
that the URL referring to page P2 is invalid or ?broken?: this could be 
done through a special HTTP action.

Server S1 then keeps trace of this URL and, if possible, of the page that 
referenced it. Actually you cannot discriminate if page P2 really doesn?t 
exist anymore or other problems occurred, so server S1 doesn?t take any 
decision on action to perform.

But if, for example,a reference counter is associated to this broken URL 
and other browsers in different times reasonably distant each other, say 
in a week period, inform server S1 that URL relative to page P2 is 
unreachable, then there?s a good chance that page P2 has been cut off.

Now server S1 could do many things:


* Automatically send an e-mail to the Webmaster informing him of the 
broken link: this one could eventually in turn ask the author of page P1 
to eliminate the link.

* Find a way to inform the next surfer that loads page P1 that the 
referred page P2 is unavailable. This could be done by parsing page P1 
before sending it and adding a new tag, say <BROKEN>, after tag <HREF> in 
page P1. The browser in turn parses the page and when it finds a broken 
link it is displayed in a different fashion, meaning ?unavailable, don?t 
click on it?, or it?s not displayed at all.


The process of parsing the page by the Web server could lead to 
efficiency problems, so it could simply send, maybe formatted as an HTML 
document, the list, if exists, of links referred by a particular page 
that appear to be unavailable. The browser receives such a list and 
behaves as before.

Conclusion

In this document I tried to give the basic idea of a mechanism useful to 
avoid following links that aren?t available anymore. Of course many other 
problems arise from this technique and not the totality of cases is 
faced. However this just wants to be the starting point for a 
constructive discussion that, in my optimistic prevision, could lead to a 
more sofisticated solution of the problem.

Received on Friday, 3 December 1999 06:19:54 UTC