Re: Date verification in HTML pages

Last-Modified is part of the HTTP 1.1 standard
(http://www.faqs.org/rfcs/rfc2616.html). As David discussed, there's
simply no way for the specification to require accuracy.

For authors who want to opt-in to standardizing the date/time format,
there are several efforts, including from the microformats community:

http://microformats.org/wiki/last-modified-examples 
http://microformats.org/wiki/datetime-design-pattern 

There's also been discussion in the WHATWG (http://www.whatwg.org/) to
add a date element to the Web Applications 1.0 specification, which
could pave the way for submission to the W3C. It would still require
opt-in from the authors, it would probably still require a microformat
for machine-readability, and there's also still the issue of trust:
who's to say the information is correct?

Vignesh, it seems like you need to convince the search engines to
provide this information to you. Since they crawl the sites, they could
certainly compare copies and decide when a site has been updated
(although that could always be gamed with nonsense changes to the
content).

Ed.

>>> David Woolley <david@djwhome.demon.co.uk> 10/12/2005 2:36:11 AM
>>>

> Is there a credible way of verifying this date or if not could it be
> enforced by the consortium in future HTML versions?

No and no.  

Modification date is actually part of the metadata and is obtained
from the filesystem for local HTML resources and from the HTTP
protocol
(and theerfore not a W3C issue) for typical internet fetches. 
However,
most pages fetched from commercial sites these days are actually
created
on the fly and therefore don't have a modification date, as it would
be
the same as the Date: header.

The reasons for creating on the fly tend to be commercial (e.g.
defeating
caches to get better access statistics (often self delusion) or
changing
and customising advertising on each access) rather than related to the
information payload that the user really wants.  Another factor is that
a
convention has developed of not just sending the actual resource
required
but also sending navigation and branding information, rather than
simply
linking to it.

Many of these could be addressed by more sophisticated use of caching
control parameters and by having server side include and more general
CGI processing synthesize a Last-Modified-Date based on the real
content,
but there is very little commercial incentive for webmasters to learn
how to do this.  Any attempt by standards organisations to make this
mandatory will simply be ignored.

For most webmasters, the prime directive is to break most of HTTP 1.1
by frustrating any attempt to cache, so they really have no incentive
to provide correct modification date metadata.

Although this is really an IETF issue, not a W3C one, one could try
to remove the tight coupling with caching by introducing a primary 
content modification date that is separate from the overall page
modification date.  However, especially as, for the supplier, the
primary content is often the advertising, this is unlikely to be
used except by people who are already providing useful modification
date information.

One could also define a metadata profile for including this
information
in meta elements, but with the same social engineering problems.

Other reasons for losing modification dates are reloading pages onto
the server when a site is rebuilt and, in at least one case which
had no reason to defeat caching, because the content provider
maintained
the site offine and re-FTPed it to the server every week to make
updates.



This message has been scanned by the NYS GOER WebShield.

Received on Wednesday, 12 October 2005 14:49:01 UTC