- From: Ian Graham <ianweb@smaug.java.utoronto.ca>
- Date: Wed, 6 Nov 1996 18:44:34 -0500 (EST)
- To: www-html@w3.org
As we all know, many HTML documents are generated dynamically. To indicate this fact, most HTTP servers omit the Last-modified: HTTP response header field when returning dynamically generated content. This is reasonable, but also very crude, as often the bulk of a 'dynamic' document does not vary, and actually has a well defined last-modification date, with only a small portion being varied on a regular basis. For example, my organization uses parsed HTML (*.shtml) for many home pages -- the parsing simply introduces a few lines of 'news of the day' text, with the majority of the document being invariant directory-like information for the site. We would like this page to be indexed by web robots, as they represent useful indices for the site. Of course, they are not, since they are served without a defined last-modified date. It would be nice if robots could instead index the non-varying content, and ignore the portion corresponding to the 'news of the day'. This means an HTML-based mechanism for indicating the status of blocks of the document. In the following I describe two possible mechanisms for doing this. One requires no changes to HTML -- just an agreed upon semantic for an attribute value. The second requires a simple change to HTML, with the benefit of providing somewhat greater information content. In both cases, I assume that the default behavior of an indexing tool is to index the document content, provided the document is delivered with an 'appropriate' HTTP last-modified: response header field. 1. Indicate Blocks That Should Not Be Indexed Blocks of the document that should not be indexed can be simply marked using a name token value "noindex" within a CLASS attribute. For example .... Some interesting document content... <P CLASS="noindex"> This block should not be indexed, as it is generated dynamically and is updated every hour. </P> Implications for Browsers: -- None, as far as I can tell. I have assumed that CLASS can take multiple name tokens (to allow for CSS-related values in addition to this special value), and that CSS will simply ignore tokens for which there are no style instructions. Is this true? Implications for Indexers: -- Robots that don't understand this would incorrectly index content that should not be indexed. 2. Indicating Text Block Expiry Date Unfortunately, times and dates canot be placed in CLASS, as they cannot be expressed as name tokens. Thus including this information requires a new attribute. An example might be: <DIV EXPIRES="Wed 06 Nov 1996 22:29:28 GMT"> <P> This block should not be indexed, as it is generated dynamically and is updated every hour. </DIV> Where the syntax for the time/date field must be one of those supported by the HTTP protocol. Implications for Browsers: -- None, in principle, provided they are smart enought to discard attributes they do not understand. This would mean a revision to HTML, and another general-purpose attribute (ID, LANG, CLASS, EXPIRES). Messy. Implications for Indexers: -- Robots that don't understand this would incorrectly index content that should not be indexed. Note that, in both cases, the server has to provide appropriate last-modifed: headers for parsed documents or CGI program output. The former is straightforward for NCSA SSI-style includes -- just use the last-modified date for the base document containing the SSI instructions. At the same time, the document would need to be rewritten to included the correct "noindex" or other attributes around the included blocks. In the latter (CGI) case, this is simply a matter of rewriting the CGI programs to return both the correct response headers and HTML attributes. Any thoughts/suggestions/flames? Ian -- Ian Graham ................................ ian.graham@utoronto.ca Information Commons Tel: 416-978-4548 University of Toronto Fax: 416-978-0440
Received on Wednesday, 6 November 1996 18:44:34 UTC