Re: Document Indexing -- How to index Dynamic Content?
Ian Graham (ianweb@smaug.java.utoronto.ca)
Thu, 7 Nov 1996 12:41:41 -0500 (EST)
From: ianweb@smaug.java.utoronto.ca (Ian Graham)
Message-Id: <199611071741.MAA26243@smaug.java.utoronto.ca>
Subject: Re: Document Indexing -- How to index Dynamic Content?
To: abigail@ny.fnx.com
Date: Thu, 7 Nov 1996 12:41:41 -0500 (EST)
Cc: www-html@w3.org
In-Reply-To: <199611070035.TAA04133@melgor.ny.fnx.com> from "Abigail" at Nov 6, 96 07:35:30 pm
> Ian Graham wrote:
> ++
> ++ As we all know, many HTML documents are generated dynamically. To
> ++ indicate this fact, most HTTP servers omit the Last-modified: HTTP
> ++ response header field when returning dynamically generated content.
> ++ This is reasonable, but also very crude, as often the bulk of a
> ++ 'dynamic' document does not vary, and actually has a well defined
> ++ last-modification date, with only a small portion being varied
> ++ on a regular basis. For example, my organization uses parsed
>
> Well, than any 'last-modified' header would have to give the current
> time anyway. Last-modified means last modified, not last large
> modification.
Actually, the HTTP spec seems very liberal in this regard -- one
could quite legally 'cheat', and use the last large modification
time. In any event, it is the complete omission of last-modified
headers that causes the most problems.
> ++
> ++ 1. Indicate Blocks That Should Not Be Indexed
> ++
> ++ 2. Indicating Text Block Expiry Date
> ++
> ++ Any thoughts/suggestions/flames?
>
> There is a third alternative. Give the robots different documents;
> documents which only contain those parts you want to see indexed. Of
> course, this only works for robots you know (you can use the user agent
> field to find out who's a robot). To exclude other robots, use your
> /robots.txt.
This is possible, but would mean maintaining two sets of documents.
However, it would also fail for 'super-bookmarking' tools that index
the entire document content. I'd rather put the information in the
data, and not have to worry about which agent was accessing the
content.
> The main advantage is, that it doesn't need any cooperation of
> robots (accept from the robot-exclusion protocol most robots seem
> to use).
THis is a good point. But, as I mention above, this feature has
utility beyond the context of robot indexers.
> A fourth method would of course be not to put static and dynamic
> content in one document.
>
> Abigail
We explictly rejected this option, for several reasons. First, the
dynamic content was generally a short paragraph in the middle of a
longer, static document -- frames simply broke the page into
three parts, and made it awkward and unwieldy to read. Second, we
need to support non-frame capable users, which means we'd need
frame-incapable version also -- and we didn't wnat to get into the
complicated process of maintaining duplicate versions.
Ian