W3C home > Mailing lists > Public > public-privacy@w3.org > October to December 2012

search engines: right to be forgotten, sitemap.xml proposed solution

From: Rob van Eijk <rob@blaeu.com>
Date: Tue, 11 Dec 2012 15:17:30 +0100
To: <public-privacy@w3.org>
Message-ID: <c10e020b7baf5511b9f6d799812b5830@xs4all.nl>

Dear all,

I am looking for feedback when it comes to the right to be forgotten in 
the domain of search engines. The challenge for the concept of the right 
to be forgotten is IMHO to add meta data to specific part of the content 
on websites. Adding meta data with XML can be used to accomplish that 
goal. I would like to draw your attention to the Sitemap.XML file. It 
looks like:

<url>
       
<loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
       <lastmod>2005-01-01</lastmod>
       <changefreq>monthly</changefreq>
       <priority>0.8</priority>
</url>

The Sitemap.XML protocol can be extended when it comes to data 
retention. For instance by adding an expiration header <Retention>30 
days</Retention> or <Retention>2013-12-31</Retention>. This meta data 
can be tied to a specific URL, in the case of the example above 
<loc>http://www.example.org/papers/right-to-be-forgotten.html</loc>. The 
application of metadata is not limited to HTML-pages, but can also be 
used for audio, pictures, video etc. Often, sitemaps are dynamically 
generated by the content management system. From a programmers 
perspective it is not difficult to enhance the module that generates the 
sitemap.xml. Also if one wishes to add metadata to existing sitemap.xml 
files outside of a content management system, adding the metadata with a 
scheduled script is also not a difficult task for a programmer.

I think it is safe to say that a technical recommendation based on 
adding an expiration header to the sitemap.xml file makes sense and is 
useful.

Proposed text:
Adding an expiration header to the Sitemap may be an elegant way to 
handle data retention policies for individual data  elements on a 
website. In order to make data retention enhanced Sitemap.XML files 
efficient, two stakeholders need to be on the same page:
•	Webmasters MAY consider the use of Sitemap.XML to add expiration 
headers to the content they are offering. These headers are an 
indication of data retention periods for specific parts of a site and 
may include deep links to HTML-pages, but can also be used for audio, 
pictures, video.
•	Search engines MUST honour expiration headers in Sitemap.XML files, 
and delete the search results accordingly. This includes the removal 
from any search cache.

XML schema for the enhanced Sitemap protocol (Sitemap.xsd):

<xsd:simpleType name="tRetention">
<xsd:annotation>
   <xsd:documentation>
       OPTIONAL: Indicates the data retention time of a particular URL. 
The value "always" should be used to describe
       content that should not be removed. The value "dateTime" should 
be used to indicate the maximum date after which the content can be 
removed from search result
       and search cache. Please note that web crawlers may not 
necessarily crawl pages marked "always" more often.
     </xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:string">
   <xsd:enumeration value="always"/>
   <xsd:enumeration value="dateTime"/>
</xsd:restriction>
</xsd:simpleType>

Sitemap protocol format consisting of XML tags (Sitemap.xml):

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.site.com/schemas/sitemap/">
<url>
       
<loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
       <lastmod>2005-01-01</lastmod>
       <changefreq>monthly</changefreq>
       <priority>0.8</priority>
       <retention>2013-12-31</retention>
</url>
</urlset>

Please let me know if this approach is of use. If so, I would like to 
learn where to address the problem in the standardization landscape: is 
there a IETF workgroup or a W3C workgroup?

Kind regard,
Rob
Received on Tuesday, 11 December 2012 14:18:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 11 December 2012 14:18:07 GMT