RE: search engines: right to be forgotten, sitemap.xml proposed solution from Shane Wiley on 2012-12-11 (public-privacy@w3.org from October to December 2012)

From: Shane Wiley <wileys@yahoo-inc.com>
Date: Tue, 11 Dec 2012 15:36:58 +0000
To: Thomas Roessler <tlr@w3.org>, "rob@blaeu.com" <rob@blaeu.com>
CC: "public-privacy@w3.org" <public-privacy@w3.org>
Message-ID: <DCCF036E573F0142BD90964789F720E307075631@GQ1-EX10-MB03.y.corp.yahoo.com>
As we've not established a "right to be forgotten" at this time legally (the bounds and practical application) isn't it premature to be developing a solution before we have a clearer sense of "the problem"?  Premature solutionation aside, in the case of a Search Engine making it easier to find information already available on the public internet - are you starting at the right place to fix your perceived problem?  The source material would need to first be removed so subsequent search index crawls/scans do not pick up the article that you as an individual would like to be forgotten (leave aside the rights of others to have public information remain public).  This approach doesn't attempt to solve for the root problem (on top of a lack of legal guidance to determine what should and should not be the bounds, if any, of what should be "forgotten").  Lastly, as search engines drive considerable site traffic, what would be the motivation for a site to voluntarily have its content age out from a search engine's index (no legal pressure here yet)?

- Shane

-----Original Message-----
From: Thomas Roessler [mailto:tlr@w3.org] 
Sent: Tuesday, December 11, 2012 8:09 AM
To: rob@blaeu.com
Cc: public-privacy@w3.org
Subject: Re: search engines: right to be forgotten, sitemap.xml proposed solution

Rob,

I think it'd be useful to take a step back and say explicitly which particular instance of the right to be forgotten you're trying to implement here.

What are the requirements that you're trying to address?

Thanks,
-- 
Thomas Roessler, W3C <tlr@w3.org> (@roessler)



On 2012-12-11, at 15:17 +0100, Rob van Eijk <rob@blaeu.com> wrote:

> 
> Dear all,
> 
> I am looking for feedback when it comes to the right to be forgotten in the domain of search engines. The challenge for the concept of the right to be forgotten is IMHO to add meta data to specific part of the content on websites. Adding meta data with XML can be used to accomplish that goal. I would like to draw your attention to the Sitemap.XML file. It looks like:
> 
> <url>
>      <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>      <lastmod>2005-01-01</lastmod>
>      <changefreq>monthly</changefreq>
>      <priority>0.8</priority>
> </url>
> 
> The Sitemap.XML protocol can be extended when it comes to data retention. For instance by adding an expiration header <Retention>30 days</Retention> or <Retention>2013-12-31</Retention>. This meta data can be tied to a specific URL, in the case of the example above <loc>http://www.example.org/papers/right-to-be-forgotten.html</loc>. The application of metadata is not limited to HTML-pages, but can also be used for audio, pictures, video etc. Often, sitemaps are dynamically generated by the content management system. From a programmers perspective it is not difficult to enhance the module that generates the sitemap.xml. Also if one wishes to add metadata to existing sitemap.xml files outside of a content management system, adding the metadata with a scheduled script is also not a difficult task for a programmer.
> 
> I think it is safe to say that a technical recommendation based on adding an expiration header to the sitemap.xml file makes sense and is useful.
> 
> Proposed text:
> Adding an expiration header to the Sitemap may be an elegant way to handle data retention policies for individual data  elements on a website. In order to make data retention enhanced Sitemap.XML files efficient, two stakeholders need to be on the same page:
> *	Webmasters MAY consider the use of Sitemap.XML to add expiration headers to the content they are offering. These headers are an indication of data retention periods for specific parts of a site and may include deep links to HTML-pages, but can also be used for audio, pictures, video.
> *	Search engines MUST honour expiration headers in Sitemap.XML files, and delete the search results accordingly. This includes the removal from any search cache.
> 
> XML schema for the enhanced Sitemap protocol (Sitemap.xsd):
> 
> <xsd:simpleType name="tRetention">
> <xsd:annotation>
>  <xsd:documentation>
>      OPTIONAL: Indicates the data retention time of a particular URL. The value "always" should be used to describe
>      content that should not be removed. The value "dateTime" should be used to indicate the maximum date after which the content can be removed from search result
>      and search cache. Please note that web crawlers may not necessarily crawl pages marked "always" more often.
>    </xsd:documentation>
> </xsd:annotation>
> <xsd:restriction base="xsd:string">
>  <xsd:enumeration value="always"/>
>  <xsd:enumeration value="dateTime"/>
> </xsd:restriction>
> </xsd:simpleType>
> 
> Sitemap protocol format consisting of XML tags (Sitemap.xml):
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <urlset xmlns="http://www.site.com/schemas/sitemap/">
> <url>
>      <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>      <lastmod>2005-01-01</lastmod>
>      <changefreq>monthly</changefreq>
>      <priority>0.8</priority>
>      <retention>2013-12-31</retention>
> </url>
> </urlset>
> 
> Please let me know if this approach is of use. If so, I would like to learn where to address the problem in the standardization landscape: is there a IETF workgroup or a W3C workgroup?
> 
> Kind regard,
> Rob
> 
> 
>
Received on Tuesday, 11 December 2012 15:37:47 UTC