Re: Orphan files from Ian Jacobs on 2010-08-21 (site-comments@w3.org from August 2010)

From: Ian Jacobs <ij@w3.org>
Date: Fri, 20 Aug 2010 21:47:52 -0500
To: Elizabeth Lloyd <elloydmalta@live.co.uk>
Cc: <site-comments@w3.org>
Message-Id: <B61B6885-2599-499C-856C-8D807FABD95F@w3.org>
On 4 Aug 2010, at 2:53 AM, Elizabeth Lloyd wrote:

> My query specifically relates to 'orphan files' e.g
> 'Guidelines on Dissemination of Information through Government  
> Websites':
> http://www.gov.hk/en/about/accessibility/docs/disseminationguidelines.pdf
>
> "33. Any out-dated or obsolete web pages should be removed from the
> production site. If these orphan pages are still retained on-line,  
> they may be
> accessible through the search results from search engines though no  
> navigation
> path to the obsolete page is available. This may result in users  
> getting
> incorrect or outdated information from the website."
>
> Scenario : a webpage is deleted from the website and an associated  
> file e.g pdf file it links to is not removed from the web server.   
> Subsequently this file is located in a search result and is loaded  
> into the web browser as a valid URL with the path to the location of  
> the file within the web server.
>
> What web server file management processes etc should be put in place  
> to avoid this file from still being publicly accessible?
>
> Are there procedures/ processes/ protocols for ensuring that the  
> associated obsolete/orphaned files previously linked to a webpage do  
> not still remain?

Hello, Elizabeth. Apologies for the delay in replying.

>
> What good practice standards in web server file management exist,  
> specifically relating to currency of files? For example, to avoid a  
> file that is years out of date from remaining on the server, being  
> accessible by a search engine and delivering inaccurate information.

I don't have a lot of experience with this, and I am not familiar with  
useful resources on this topic. (I welcome input from others on this  
list.) Here are a few ideas:

  * Put status information in documents and act as though the first  
time you publish them it will be the last time you publish them. There  
should be enough status information so that if somebody finds the  
document five years later, they understand the context in which it was  
created, where to look for the most up-to-date information, and whom  
to contact with questions.

   If a document becomes outdated, you can update the status  
information and leave the document on the site for historical  
purposes. In this case, provide a link so that people can find more up- 
to-date information.

  * You might be able to run software over your site and look for  
documents that have not changed in a long time AND that are not in a  
list of documents known to be ok even if that haven't changed.

  * Maintaining information up-to-date seems to me to be largely a  
social process. Documents that are published but that are not part of  
any routine review are likely to become outdated. Documents that are  
reviewed regularly have a better chance of being kept up to date or of  
being explicitly marked as outdated.

  * Lastly, HTTP offers redirect mechanisms so that at the server  
level you can automatically redirect people from URIs that are no  
longer maintained to ones that are. Of course, this relies on a social  
process of knowing which URIs you want to redirect.

Hope that helps,

  _ Ian

>
> Kind regards,
>
> Elizabeth Lloyd

--
Ian Jacobs (ij@w3.org)    http://www.w3.org/People/Jacobs/
Tel:                                      +1 718 260 9447
Received on Saturday, 21 August 2010 02:47:56 UTC