Re: Document Indexing -- How to index Dynamic Content?

Ian Graham wrote:
++ 
++ As we all know, many HTML documents are generated dynamically. To 
++ indicate this fact, most HTTP servers omit the Last-modified: HTTP 
++ response header field when returning dynamically generated content. 
++ This is reasonable, but also very crude, as often the bulk of a 
++ 'dynamic' document does not vary, and actually has a well defined
++ last-modification date, with only a small portion being varied
++ on a regular basis. For example, my organization uses parsed 

Well, than any 'last-modified' header would have to give the current
time anyway. Last-modified means last modified, not last large
modification.

++ HTML (*.shtml) for many home pages -- the parsing simply introduces 
++ a few lines of 'news of the day' text, with the majority of the 
++ document being invariant directory-like information for the site. 
++ We would like this page to be indexed by web robots, as they
++ represent useful indices for the site.
++ 
++ Of course, they are not, since they are served without a defined
++ last-modified date.  It would be nice if robots could instead
++ index the non-varying content, and ignore the portion corresponding 
++ to the 'news of the day'.
++ 
++ This means an HTML-based mechanism for indicating the status of 
++ blocks of the document. In the following I describe two possible 
++ mechanisms for doing this.  One requires no changes to HTML -- 
++ just an agreed upon semantic for an attribute value. The second 
++ requires a simple change to HTML, with the benefit of providing 
++ somewhat greater information content.  In both cases, I assume 
++ that the default behavior of an indexing tool is to index the 
++ document content, provided the document is delivered with an 
++ 'appropriate' HTTP last-modified: response header field.
++ 
++ 1. Indicate Blocks That Should Not Be Indexed
++ 
++ 2. Indicating Text Block Expiry Date
++ 
++ Any thoughts/suggestions/flames? 


There is a third alternative. Give the robots different documents;
documents which only contain those parts you want to see indexed.  Of
course, this only works for robots you know (you can use the user agent
field to find out who's a robot). To exclude other robots, use your
/robots.txt.

The main advantage is, that it doesn't need any cooperation of
robots (accept from the robot-exclusion protocol most robots seem
to use).


A fourth method would of course be not to put static and dynamic
content in one document.


Abigail

Received on Wednesday, 6 November 1996 19:33:49 UTC