Re: Document Indexing -- How to index Dynamic Content?

| As we all know, many HTML documents are generated dynamically. To 
| indicate this fact, most HTTP servers omit the Last-modified: HTTP 
| response header field when returning dynamically generated content. 
| This is reasonable, but also very crude, as often the bulk of a 
| 'dynamic' document does not vary, and actually has a well defined
| last-modification date, with only a small portion being varied
| on a regular basis. For example, my organization uses parsed 
| HTML (*.shtml) for many home pages -- the parsing simply introduces 
| a few lines of 'news of the day' text, with the majority of the 
| document being invariant directory-like information for the site. 
| We would like this page to be indexed by web robots, as they
| represent useful indices for the site.

This is an error, or oversight of the script authors.  When I finish
writing my scripts I very much hope to be able to determine an
"approximate last modified" time and sometimes even an "expires"
time...

| This means an HTML-based mechanism for indicating the status of 
| blocks of the document. In the following I describe two possible 
| mechanisms for doing this.  One requires no changes to HTML -- 
| just an agreed upon semantic for an attribute value. The second 
| requires a simple change to HTML, with the benefit of providing 
| somewhat greater information content.  In both cases, I assume 
| that the default behavior of an indexing tool is to index the 
| document content, provided the document is delivered with an 
| 'appropriate' HTTP last-modified: response header field.

Flat out, most robots should not index anything that changes more often
than they can sample it... but how do they know that?  Of the searches
I have seen and am very unpleased with their operation from start to
finish...  They allow pages that change to often, they miss other pages
that change only moderately, they forget to remove pages that no longer
exist ...etc...

Received on Wednesday, 6 November 1996 23:48:43 UTC