Re: search engines: right to be forgotten, sitemap.xml proposed solution

Rob,

I think it'd be useful to take a step back and say explicitly which particular instance of the right to be forgotten you're trying to implement here.

What are the requirements that you're trying to address?

Thanks,
-- 
Thomas Roessler, W3C <tlr@w3.org> (@roessler)



On 2012-12-11, at 15:17 +0100, Rob van Eijk <rob@blaeu.com> wrote:

> 
> Dear all,
> 
> I am looking for feedback when it comes to the right to be forgotten in the domain of search engines. The challenge for the concept of the right to be forgotten is IMHO to add meta data to specific part of the content on websites. Adding meta data with XML can be used to accomplish that goal. I would like to draw your attention to the Sitemap.XML file. It looks like:
> 
> <url>
>      <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>      <lastmod>2005-01-01</lastmod>
>      <changefreq>monthly</changefreq>
>      <priority>0.8</priority>
> </url>
> 
> The Sitemap.XML protocol can be extended when it comes to data retention. For instance by adding an expiration header <Retention>30 days</Retention> or <Retention>2013-12-31</Retention>. This meta data can be tied to a specific URL, in the case of the example above <loc>http://www.example.org/papers/right-to-be-forgotten.html</loc>. The application of metadata is not limited to HTML-pages, but can also be used for audio, pictures, video etc. Often, sitemaps are dynamically generated by the content management system. From a programmers perspective it is not difficult to enhance the module that generates the sitemap.xml. Also if one wishes to add metadata to existing sitemap.xml files outside of a content management system, adding the metadata with a scheduled script is also not a difficult task for a programmer.
> 
> I think it is safe to say that a technical recommendation based on adding an expiration header to the sitemap.xml file makes sense and is useful.
> 
> Proposed text:
> Adding an expiration header to the Sitemap may be an elegant way to handle data retention policies for individual data  elements on a website. In order to make data retention enhanced Sitemap.XML files efficient, two stakeholders need to be on the same page:
> •	Webmasters MAY consider the use of Sitemap.XML to add expiration headers to the content they are offering. These headers are an indication of data retention periods for specific parts of a site and may include deep links to HTML-pages, but can also be used for audio, pictures, video.
> •	Search engines MUST honour expiration headers in Sitemap.XML files, and delete the search results accordingly. This includes the removal from any search cache.
> 
> XML schema for the enhanced Sitemap protocol (Sitemap.xsd):
> 
> <xsd:simpleType name="tRetention">
> <xsd:annotation>
>  <xsd:documentation>
>      OPTIONAL: Indicates the data retention time of a particular URL. The value "always" should be used to describe
>      content that should not be removed. The value "dateTime" should be used to indicate the maximum date after which the content can be removed from search result
>      and search cache. Please note that web crawlers may not necessarily crawl pages marked "always" more often.
>    </xsd:documentation>
> </xsd:annotation>
> <xsd:restriction base="xsd:string">
>  <xsd:enumeration value="always"/>
>  <xsd:enumeration value="dateTime"/>
> </xsd:restriction>
> </xsd:simpleType>
> 
> Sitemap protocol format consisting of XML tags (Sitemap.xml):
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <urlset xmlns="http://www.site.com/schemas/sitemap/">
> <url>
>      <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>      <lastmod>2005-01-01</lastmod>
>      <changefreq>monthly</changefreq>
>      <priority>0.8</priority>
>      <retention>2013-12-31</retention>
> </url>
> </urlset>
> 
> Please let me know if this approach is of use. If so, I would like to learn where to address the problem in the standardization landscape: is there a IETF workgroup or a W3C workgroup?
> 
> Kind regard,
> Rob
> 
> 
> 

Received on Tuesday, 11 December 2012 15:08:59 UTC