Document Indexing -- How to index Dynamic Content?

As we all know, many HTML documents are generated dynamically. To 
indicate this fact, most HTTP servers omit the Last-modified: HTTP 
response header field when returning dynamically generated content. 
This is reasonable, but also very crude, as often the bulk of a 
'dynamic' document does not vary, and actually has a well defined
last-modification date, with only a small portion being varied
on a regular basis. For example, my organization uses parsed 
HTML (*.shtml) for many home pages -- the parsing simply introduces 
a few lines of 'news of the day' text, with the majority of the 
document being invariant directory-like information for the site. 
We would like this page to be indexed by web robots, as they
represent useful indices for the site.

Of course, they are not, since they are served without a defined
last-modified date.  It would be nice if robots could instead
index the non-varying content, and ignore the portion corresponding 
to the 'news of the day'.

This means an HTML-based mechanism for indicating the status of 
blocks of the document. In the following I describe two possible 
mechanisms for doing this.  One requires no changes to HTML -- 
just an agreed upon semantic for an attribute value. The second 
requires a simple change to HTML, with the benefit of providing 
somewhat greater information content.  In both cases, I assume 
that the default behavior of an indexing tool is to index the 
document content, provided the document is delivered with an 
'appropriate' HTTP last-modified: response header field.

1. Indicate Blocks That Should Not Be Indexed

   Blocks of the document that should not be indexed can be
   simply marked using a name token value "noindex" within a 
   CLASS attribute. For example
 
   .... Some interesting document content...
   <P CLASS="noindex"> This block should not be indexed, as 
   it is generated dynamically and is updated every hour. </P>

   Implications for Browsers: -- None,  as far as I can tell. I 
   have assumed that CLASS can take multiple name tokens (to allow 
   for CSS-related values in addition to this special value), and 
   that CSS will simply ignore tokens for which there are no style 
   instructions. Is this true?

   Implications for Indexers: -- Robots that don't understand this
   would incorrectly index content that should not be indexed.  
 
2. Indicating Text Block Expiry Date

   Unfortunately, times and dates canot be placed in CLASS,  as they
   cannot be expressed as name tokens. Thus including this 
   information requires a new attribute. An example might be:

   <DIV EXPIRES="Wed 06 Nov 1996 22:29:28 GMT">
      <P> This block should not be indexed, as it is generated
      dynamically and is updated every hour.
   </DIV>

   Where the syntax for the time/date field must be one of those 
   supported by the HTTP protocol. 

   Implications for Browsers: -- None, in principle, provided they are
   smart enought to discard attributes they do not understand. This
   would mean a revision to HTML, and another general-purpose
   attribute (ID, LANG, CLASS, EXPIRES).  Messy.

   Implications for Indexers: -- Robots that don't understand this
   would incorrectly index content that should not be indexed.  

Note that, in both cases, the server has to provide appropriate 
last-modifed: headers for parsed documents or CGI program output. 
The former is straightforward for NCSA SSI-style includes -- 
just use the last-modified date for the base document containing 
the SSI instructions. At the same time, the document would need 
to be rewritten to included the correct "noindex" or other attributes 
around the included blocks. In the latter (CGI) case, this is 
simply a matter of rewriting the CGI programs to return both 
the correct response headers and HTML attributes.

Any thoughts/suggestions/flames? 

Ian
--
Ian Graham ................................ ian.graham@utoronto.ca
Information Commons                              Tel: 416-978-4548
University of Toronto                            Fax: 416-978-0440

Received on Wednesday, 6 November 1996 18:44:34 UTC