Re: search engines: right to be forgotten, sitemap.xml proposed solution from Rob van Eijk on 2012-12-11 (public-privacy@w3.org from October to December 2012)

From: Rob van Eijk <rob@blaeu.com>
Date: Tue, 11 Dec 2012 16:29:26 +0100
To: <public-privacy@w3.org>
Message-ID: <c0ee92d3eda8fa619c83bd6062487a8c@xs4all.nl>
Hi Thomas,

The functional requirement to not be indexed can already be 
accomplished with a robots.txt. However, for content indexed it is more 
difficult. Sitemap.xml has the functionality to indicate when a robot 
should revisit (if I am correct). It would strengthen the protocol 
however is on a granular level, it would be possible to indicate the 
retention time of a specific content element. That is the functionality 
I am interested in.

I would like to accomplish two actions: first, on the webmaster side, 
adding meta data to the content, which signal data subject’s wish to the 
outer world (e.g. expiration date, or do-not-index, etc.) and second, 
extending the functionalities of existing protocols in order to 
implement more standardized data access rules for external parties 
(search engines in primis).


Rob

Thomas Roessler schreef op 2012-12-11 16:08:
> Rob,
> 
> I think it'd be useful to take a step back and say explicitly which
> particular instance of the right to be forgotten you're trying to
> implement here.
> 
> What are the requirements that you're trying to address?
> 
> Thanks,
> --
> Thomas Roessler, W3C <tlr@w3.org> (@roessler)
> 
> 
> 
> On 2012-12-11, at 15:17 +0100, Rob van Eijk <rob@blaeu.com> wrote:
> 
>> 
>> Dear all,
>> 
>> I am looking for feedback when it comes to the right to be forgotten 
>> in the domain of search engines. The challenge for the concept of the 
>> right to be forgotten is IMHO to add meta data to specific part of the 
>> content on websites. Adding meta data with XML can be used to 
>> accomplish that goal. I would like to draw your attention to the 
>> Sitemap.XML file. It looks like:
>> 
>> <url>
>>      
>> <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>>      <lastmod>2005-01-01</lastmod>
>>      <changefreq>monthly</changefreq>
>>      <priority>0.8</priority>
>> </url>
>> 
>> The Sitemap.XML protocol can be extended when it comes to data 
>> retention. For instance by adding an expiration header <Retention>30 
>> days</Retention> or <Retention>2013-12-31</Retention>. This meta data 
>> can be tied to a specific URL, in the case of the example above 
>> <loc>http://www.example.org/papers/right-to-be-forgotten.html</loc>. 
>> The application of metadata is not limited to HTML-pages, but can also 
>> be used for audio, pictures, video etc. Often, sitemaps are 
>> dynamically generated by the content management system. From a 
>> programmers perspective it is not difficult to enhance the module that 
>> generates the sitemap.xml. Also if one wishes to add metadata to 
>> existing sitemap.xml files outside of a content management system, 
>> adding the metadata with a scheduled script is also not a difficult 
>> task for a programmer.
>> 
>> I think it is safe to say that a technical recommendation based on 
>> adding an expiration header to the sitemap.xml file makes sense and is 
>> useful.
>> 
>> Proposed text:
>> Adding an expiration header to the Sitemap may be an elegant way to 
>> handle data retention policies for individual data  elements on a 
>> website. In order to make data retention enhanced Sitemap.XML files 
>> efficient, two stakeholders need to be on the same page:
>> • Webmasters MAY consider the use of Sitemap.XML to add expiration 
>> headers to the content they are offering. These headers are an 
>> indication of data retention periods for specific parts of a site and 
>> may include deep links to HTML-pages, but can also be used for audio, 
>> pictures, video.
>> • Search engines MUST honour expiration headers in Sitemap.XML files, 
>> and delete the search results accordingly. This includes the removal 
>> from any search cache.
>> 
>> XML schema for the enhanced Sitemap protocol (Sitemap.xsd):
>> 
>> <xsd:simpleType name="tRetention">
>> <xsd:annotation>
>>  <xsd:documentation>
>>      OPTIONAL: Indicates the data retention time of a particular URL. 
>> The value "always" should be used to describe
>>      content that should not be removed. The value "dateTime" should 
>> be used to indicate the maximum date after which the content can be 
>> removed from search result
>>      and search cache. Please note that web crawlers may not 
>> necessarily crawl pages marked "always" more often.
>>    </xsd:documentation>
>> </xsd:annotation>
>> <xsd:restriction base="xsd:string">
>>  <xsd:enumeration value="always"/>
>>  <xsd:enumeration value="dateTime"/>
>> </xsd:restriction>
>> </xsd:simpleType>
>> 
>> Sitemap protocol format consisting of XML tags (Sitemap.xml):
>> 
>> <?xml version="1.0" encoding="UTF-8"?>
>> <urlset xmlns="http://www.site.com/schemas/sitemap/">
>> <url>
>>      
>> <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>>      <lastmod>2005-01-01</lastmod>
>>      <changefreq>monthly</changefreq>
>>      <priority>0.8</priority>
>>      <retention>2013-12-31</retention>
>> </url>
>> </urlset>
>> 
>> Please let me know if this approach is of use. If so, I would like to 
>> learn where to address the problem in the standardization landscape: 
>> is there a IETF workgroup or a W3C workgroup?
>> 
>> Kind regard,
>> Rob
>> 
>> 
>>
Received on Tuesday, 11 December 2012 15:30:00 UTC