- From: Ian Graham <ianweb@smaug.java.utoronto.ca>
- Date: Wed, 6 Nov 1996 18:44:34 -0500 (EST)
- To: www-html@w3.org
As we all know, many HTML documents are generated dynamically. To
indicate this fact, most HTTP servers omit the Last-modified: HTTP
response header field when returning dynamically generated content.
This is reasonable, but also very crude, as often the bulk of a
'dynamic' document does not vary, and actually has a well defined
last-modification date, with only a small portion being varied
on a regular basis. For example, my organization uses parsed
HTML (*.shtml) for many home pages -- the parsing simply introduces
a few lines of 'news of the day' text, with the majority of the
document being invariant directory-like information for the site.
We would like this page to be indexed by web robots, as they
represent useful indices for the site.
Of course, they are not, since they are served without a defined
last-modified date. It would be nice if robots could instead
index the non-varying content, and ignore the portion corresponding
to the 'news of the day'.
This means an HTML-based mechanism for indicating the status of
blocks of the document. In the following I describe two possible
mechanisms for doing this. One requires no changes to HTML --
just an agreed upon semantic for an attribute value. The second
requires a simple change to HTML, with the benefit of providing
somewhat greater information content. In both cases, I assume
that the default behavior of an indexing tool is to index the
document content, provided the document is delivered with an
'appropriate' HTTP last-modified: response header field.
1. Indicate Blocks That Should Not Be Indexed
Blocks of the document that should not be indexed can be
simply marked using a name token value "noindex" within a
CLASS attribute. For example
.... Some interesting document content...
<P CLASS="noindex"> This block should not be indexed, as
it is generated dynamically and is updated every hour. </P>
Implications for Browsers: -- None, as far as I can tell. I
have assumed that CLASS can take multiple name tokens (to allow
for CSS-related values in addition to this special value), and
that CSS will simply ignore tokens for which there are no style
instructions. Is this true?
Implications for Indexers: -- Robots that don't understand this
would incorrectly index content that should not be indexed.
2. Indicating Text Block Expiry Date
Unfortunately, times and dates canot be placed in CLASS, as they
cannot be expressed as name tokens. Thus including this
information requires a new attribute. An example might be:
<DIV EXPIRES="Wed 06 Nov 1996 22:29:28 GMT">
<P> This block should not be indexed, as it is generated
dynamically and is updated every hour.
</DIV>
Where the syntax for the time/date field must be one of those
supported by the HTTP protocol.
Implications for Browsers: -- None, in principle, provided they are
smart enought to discard attributes they do not understand. This
would mean a revision to HTML, and another general-purpose
attribute (ID, LANG, CLASS, EXPIRES). Messy.
Implications for Indexers: -- Robots that don't understand this
would incorrectly index content that should not be indexed.
Note that, in both cases, the server has to provide appropriate
last-modifed: headers for parsed documents or CGI program output.
The former is straightforward for NCSA SSI-style includes --
just use the last-modified date for the base document containing
the SSI instructions. At the same time, the document would need
to be rewritten to included the correct "noindex" or other attributes
around the included blocks. In the latter (CGI) case, this is
simply a matter of rewriting the CGI programs to return both
the correct response headers and HTML attributes.
Any thoughts/suggestions/flames?
Ian
--
Ian Graham ................................ ian.graham@utoronto.ca
Information Commons Tel: 416-978-4548
University of Toronto Fax: 416-978-0440
Received on Wednesday, 6 November 1996 18:44:34 UTC