Re: Document Indexing -- How to index Dynamic Content?

Ian Graham (ianweb@smaug.java.utoronto.ca)
Thu, 7 Nov 1996 12:41:41 -0500 (EST)


From: ianweb@smaug.java.utoronto.ca (Ian Graham)
Message-Id: <199611071741.MAA26243@smaug.java.utoronto.ca>
Subject: Re: Document Indexing -- How to index Dynamic Content?
To: abigail@ny.fnx.com
Date: Thu, 7 Nov 1996 12:41:41 -0500 (EST)
Cc: www-html@w3.org
In-Reply-To: <199611070035.TAA04133@melgor.ny.fnx.com> from "Abigail" at Nov 6, 96 07:35:30 pm

> Ian Graham wrote:
> ++ 
> ++ As we all know, many HTML documents are generated dynamically. To 
> ++ indicate this fact, most HTTP servers omit the Last-modified: HTTP 
> ++ response header field when returning dynamically generated content. 
> ++ This is reasonable, but also very crude, as often the bulk of a 
> ++ 'dynamic' document does not vary, and actually has a well defined
> ++ last-modification date, with only a small portion being varied
> ++ on a regular basis. For example, my organization uses parsed 
> 
> Well, than any 'last-modified' header would have to give the current
> time anyway. Last-modified means last modified, not last large
> modification.

Actually, the HTTP spec seems very liberal in this regard -- one 
could quite legally 'cheat', and use the last large modification 
time.  In any event, it is the complete omission of last-modified 
headers that causes the most problems.

> ++ 
> ++ 1. Indicate Blocks That Should Not Be Indexed
> ++ 
> ++ 2. Indicating Text Block Expiry Date
> ++ 
> ++ Any thoughts/suggestions/flames? 
> 
> There is a third alternative. Give the robots different documents;
> documents which only contain those parts you want to see indexed.  Of
> course, this only works for robots you know (you can use the user agent
> field to find out who's a robot). To exclude other robots, use your
> /robots.txt.

This is possible, but would mean maintaining two sets of documents.
However, it would also fail for 'super-bookmarking' tools that index
the entire document content.  I'd rather put the information in the
data, and not have to worry about which agent was accessing the 
content.  
  
> The main advantage is, that it doesn't need any cooperation of
> robots (accept from the robot-exclusion protocol most robots seem
> to use).

THis is a good point. But, as I mention above, this feature has 
utility beyond the context of robot indexers.

> A fourth method would of course be not to put static and dynamic
> content in one document.
>
> Abigail

We explictly rejected this option, for several reasons.  First, the 
dynamic content was generally a short paragraph in the middle of a 
longer, static document -- frames simply broke the page into
three parts, and made it awkward and unwieldy to read. Second, we 
need to support non-frame capable users, which means we'd need 
frame-incapable version also -- and we didn't wnat to get into the 
complicated process of maintaining duplicate versions.  

Ian