RE: search engines: right to be forgotten, sitemap.xml proposed solution from Rob van Eijk on 2012-12-11 (public-privacy@w3.org from October to December 2012)

From: Rob van Eijk <rob@blaeu.com>
Date: Tue, 11 Dec 2012 17:13:28 +0100
To: <public-privacy@w3.org>
Message-ID: <8a02ccb0bfd8373139658c736ce64c8e@xs4all.nl>
Hi Shane,

> Lastly, as search
> engines drive considerable site traffic, what would be the motivation
> for a site to voluntarily have its content age out from a search
> engine's index (no legal pressure here yet)?

I agree with Karl, there is a need for a *dedicated* protocol which 
would not freak out devops and sysadmins, which would be secure, and 
makes it possible for any users depending on their server to communicate 
what they want to hide from which clients.

To add to that, maybe a better way to phrase the topic at hand would 
be, 'the need for a dedicated protocol to not be found' instead of 'the 
right to be forgotten'.

Rob


Shane Wiley schreef op 2012-12-11 16:36:
> As we've not established a "right to be forgotten" at this time
> legally (the bounds and practical application) isn't it premature to
> be developing a solution before we have a clearer sense of "the
> problem"?  Premature solutionation aside, in the case of a Search
> Engine making it easier to find information already available on the
> public internet - are you starting at the right place to fix your
> perceived problem?  The source material would need to first be removed
> so subsequent search index crawls/scans do not pick up the article
> that you as an individual would like to be forgotten (leave aside the
> rights of others to have public information remain public).  This
> approach doesn't attempt to solve for the root problem (on top of a
> lack of legal guidance to determine what should and should not be the
> bounds, if any, of what should be "forgotten").  Lastly, as search
> engines drive considerable site traffic, what would be the motivation
> for a site to voluntarily have its content age out from a search
> engine's index (no legal pressure here yet)?
> 
> - Shane
> 
> -----Original Message-----
> From: Thomas Roessler [mailto:tlr@w3.org]
> Sent: Tuesday, December 11, 2012 8:09 AM
> To: rob@blaeu.com
> Cc: public-privacy@w3.org
> Subject: Re: search engines: right to be forgotten, sitemap.xml
> proposed solution
> 
> Rob,
> 
> I think it'd be useful to take a step back and say explicitly which
> particular instance of the right to be forgotten you're trying to
> implement here.
> 
> What are the requirements that you're trying to address?
> 
> Thanks,
> --
> Thomas Roessler, W3C <tlr@w3.org> (@roessler)
> 
> 
> 
> On 2012-12-11, at 15:17 +0100, Rob van Eijk <rob@blaeu.com> wrote:
> 
>> 
>> Dear all,
>> 
>> I am looking for feedback when it comes to the right to be forgotten 
>> in the domain of search engines. The challenge for the concept of the 
>> right to be forgotten is IMHO to add meta data to specific part of the 
>> content on websites. Adding meta data with XML can be used to 
>> accomplish that goal. I would like to draw your attention to the 
>> Sitemap.XML file. It looks like:
>> 
>> <url>
>>      
>> <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>>      <lastmod>2005-01-01</lastmod>
>>      <changefreq>monthly</changefreq>
>>      <priority>0.8</priority>
>> </url>
>> 
>> The Sitemap.XML protocol can be extended when it comes to data 
>> retention. For instance by adding an expiration header <Retention>30 
>> days</Retention> or <Retention>2013-12-31</Retention>. This meta data 
>> can be tied to a specific URL, in the case of the example above 
>> <loc>http://www.example.org/papers/right-to-be-forgotten.html</loc>. 
>> The application of metadata is not limited to HTML-pages, but can also 
>> be used for audio, pictures, video etc. Often, sitemaps are 
>> dynamically generated by the content management system. From a 
>> programmers perspective it is not difficult to enhance the module that 
>> generates the sitemap.xml. Also if one wishes to add metadata to 
>> existing sitemap.xml files outside of a content management system, 
>> adding the metadata with a scheduled script is also not a difficult 
>> task for a programmer.
>> 
>> I think it is safe to say that a technical recommendation based on 
>> adding an expiration header to the sitemap.xml file makes sense and is 
>> useful.
>> 
>> Proposed text:
>> Adding an expiration header to the Sitemap may be an elegant way to 
>> handle data retention policies for individual data  elements on a 
>> website. In order to make data retention enhanced Sitemap.XML files 
>> efficient, two stakeholders need to be on the same page:
>> * Webmasters MAY consider the use of Sitemap.XML to add expiration 
>> headers to the content they are offering. These headers are an 
>> indication of data retention periods for specific parts of a site and 
>> may include deep links to HTML-pages, but can also be used for audio, 
>> pictures, video.
>> * Search engines MUST honour expiration headers in Sitemap.XML files, 
>> and delete the search results accordingly. This includes the removal 
>> from any search cache.
>> 
>> XML schema for the enhanced Sitemap protocol (Sitemap.xsd):
>> 
>> <xsd:simpleType name="tRetention">
>> <xsd:annotation>
>>  <xsd:documentation>
>>      OPTIONAL: Indicates the data retention time of a particular URL. 
>> The value "always" should be used to describe
>>      content that should not be removed. The value "dateTime" should 
>> be used to indicate the maximum date after which the content can be 
>> removed from search result
>>      and search cache. Please note that web crawlers may not 
>> necessarily crawl pages marked "always" more often.
>>    </xsd:documentation>
>> </xsd:annotation>
>> <xsd:restriction base="xsd:string">
>>  <xsd:enumeration value="always"/>
>>  <xsd:enumeration value="dateTime"/>
>> </xsd:restriction>
>> </xsd:simpleType>
>> 
>> Sitemap protocol format consisting of XML tags (Sitemap.xml):
>> 
>> <?xml version="1.0" encoding="UTF-8"?>
>> <urlset xmlns="http://www.site.com/schemas/sitemap/">
>> <url>
>>      
>> <loc>http://www.voorbeeld.nl/papers/right-to-be-forgotten.html</loc>
>>      <lastmod>2005-01-01</lastmod>
>>      <changefreq>monthly</changefreq>
>>      <priority>0.8</priority>
>>      <retention>2013-12-31</retention>
>> </url>
>> </urlset>
>> 
>> Please let me know if this approach is of use. If so, I would like to 
>> learn where to address the problem in the standardization landscape: 
>> is there a IETF workgroup or a W3C workgroup?
>> 
>> Kind regard,
>> Rob
>> 
>> 
>>
Received on Tuesday, 11 December 2012 16:14:10 UTC