Issues On Outdated documents

Michele Bassan (michele@pdigi3.igi.pd.cnr.it)
Wed, 13 Dec 1995 10:47:19 +0100


Message-Id: <v01510102acf45031ca02@[150.178.3.72]>
Date: Wed, 13 Dec 1995 10:47:19 +0100
To: www-html@w3.org
From: michele@pdigi3.igi.pd.cnr.it (Michele Bassan)
Subject: Issues On Outdated documents

Dear colleagues,
I would like to contribute some ideas to the HTML community. I just
subscribed to this list, following a suggestion of an eminent contributor
to the HTML development, and I hope I'm posting my comments in the correct
environment.
If not so, please anyone tell me.

In the following please find a list of problems I identified and the solutions
I'm proposing.

Problems identified:

A often a document content is outdated, and this could have been known at the
  very moment of the document production.
B often a document is moved to another location, and the only way to know that
  it was moved is to read some warning message that the mover was so kind to
  leave, if we are lucky also adding a link to the new location; this does
  happen, but not so often.


Consequences:

A 1 bandwidth is lost to transfer the outdated document
  2 human time is lost retrieving the document and discovering, often after
    reading through part of it, that it is no longer valid/useful
  3 the databases of documents will grow only adding to old information and not
    just replace it with new data
  4 the database replies to the searches will be (they are already)
    increasingly unmanageable, despite any restrictive condition one can
    imagine to apply
  5 (second level consequence) the validity of the net for conveniently
    retrieving reliable information can be questioned.

B 1 valuable human time is lost to manually follow broken links, jumping around
    the net
  2 database replies will keep giving for long time incorrect document
    locations

With some thinling more bad effects can be added to this list, but probably
I made the idea already clear.

Maybe (I'm not a Web techie, I write this just based on common sense)
the documents aging problem has already been addressed by the various Web
crawlers, repeatedly checking for what has been thrown away, and for what is
new, but I still think that such an approach is not the right solution.

Proposed solutions:

Prefax
All the aging and location information I'm writing about is an information
about the document itself. Therefore the correct location is within the META
elements.
I am personally interested, as a Web pages provider, to fix these problems, or
find out if and how they have already been fixed.

A 1 The document has a single definite expiry date, its contents are uselesss
    in any following date. I'm proposing to use the following sintax:
    <META NAME="EXPIRY" CONTENT="DD MMM YYYY">
    With this information document databases have the possibility to
    perform the following actions:
    - replying to a database search before the expiry date, they can display its
      expiry date together with the relevant document info
    - trash any document info after the expiry date
    - avoid adding to the database any newly discovered document already
      expired
    Web browsers can highlight this information for the user (together
    with the title?)
A 2 The document has some information with finite lifetime, but will likely be
    updated at some later time, (e.g. the program of a theatre).
    I'm  proposing to use the following sintax:
    <META NAME="NEXT_UPDATE" CONTENT="DD MMM YYYY">
    With this information document databases have the possibility to
    perform the following actions:
    - replying to a database search before the expiry date, they can display
      the next update date together with the relevant document info
    - retrieve again the document info after the update date (e.g the index
      words might have changed, because also Puccini has been added to the
      theatre program)
    Web browsers can highlight this information for the user (together
    with the title?)

Of course the writer is not forced to provide this info, BTW using these META
tags he will be sure that the document info will be always regularly updated
and right in time, with no indefinite delays.

B 1 A new small 'placeholder' document shall replace the old one, clearly with
    the same name. I'm proposing to use the following sintax:
    <META NAME="MOVED" CONTENT="http:etc etc">
    Instead of "http:...", 'ftp:..." or whatever applicable can be
    used.
    With this information document databases have the possibility to
    perform the following action:
    - update the document links
    Web browsers can automatically follow the new link (highlighting the evet
    to the user?), and if the old link was also a bookmark, update it.

If a document will presumably be moved at some time in the future, the
association of an UPDATE attribute in the original document and of a MOVED
attribute in
the 'placeholder' document will provide for an immediate update of database
pointers after the date indicated.

Final remarks

I believe that the http servers should also provide two tables for the
crawlers, keeping trace with a daily schedule of what is appearing and
disappearing from the site (do they do that already?). This will at least
filter out some noise related to changes in old documents or in documents not
following this reccommendation. Still, these tables will not be able to suggest
the removal of outdated information, when a file remains 'forgotten' in the
site, and they will not guarantee a 'right in time' update of the databases
when the scheduled document udpate takes place.

What if the update does not take place as scheduled?
Well, I can generate more ideas to trap oddities and codify the behaviour, but
I think I've already written too much. If this thread will go on,
everything will
eventually be ironed out.

Please consider that if now these problems and their consequences are annoying
but still manageable, since the number of Web documents is growing
'exponentially' we are going to face an incredible amount of junk information.

Thanks for your attention, yours faithfully,
Michele Bassan

Via XXIV maggio, 10
35010 Vigonza - Padova - Italy
michele@pdigi3.igi.pd.cnr.it
i3@intercity.shiny.it
http://intercity.shiny.it/i3