W3C home > Mailing lists > Public > public-privacy@w3.org > October to December 2012

Re: search engines: right to be forgotten, sitemap.xml proposed solution

From: Karl Dubost <karld@opera.com>
Date: Wed, 12 Dec 2012 00:59:59 +0900
Message-Id: <46FE9F14-4EE5-433F-8E63-243F9BF5DE74@opera.com>
Cc: <public-privacy@w3.org>
To: <rob@blaeu.com>

Le 12 déc. 2012 à 00:29, Rob van Eijk a écrit :
> The functional requirement to not be indexed can already be accomplished with a robots.txt. However, for content indexed it is more difficult.

Yes and no for having directly tested it.
There are a few things which are wrong in both robots.txt and sitemap.xml [1]

Background: I was having a very good search engine karma. I decided to get out of search engines when everyone was trying to have more exposure. (I can explain why if it interests someone).

So first thing, I created a robots.txt with inside:

User-agent: *
Disallow: /

I modified later and added a specific URI space so people could find things I had done for the public as large.

User-agent: *
Disallow: /
Allow: /w3c

This protects quite well for search engine, and I basically disappeared from REGULAR search engines results.

A minor issue was solved with the help of people at Google. People who had my feed in Google Reader was feeding it to the database of the search engine, without checking robots.txt. It is solved now.

But robots.txt has 4 major issues:

* granularity scalability for major sites (text file to parse)
* control at the root, so it doesn't work for web sites with multi-users.
* It **advertises** what it wants to hide.
* It works only with good guys.

Then you want to block bad bots, rogue bots, etc. You can for example create something like this in .htaccess (for httpd apache). 

# Bots
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*whoisi*." bad_bot
# added for abusive downloads
SetEnvIfNoCase User-Agent ".*DataAccess/1.0*." bad_bot
Order Allow,Deny
Deny from env=bad_bot
Allow from all

The benefit is that the user agent doesn't know why it is blocked. It is just blocked. You do not have to reveal what the user agent should not access. The big issue, completely unfriendly for users, and need access privileges to server side configuration.

There is a need for a *dedicated* protocol which would not freak out devops and sysadmins, which would be secure, and makes it possible for any users depending on their server to communicate what they want to hide from which clients. 

[1]: http://www.w3.org/2008/09/msnws/papers/olivier-karl

Karl Dubost - http://dev.opera.com/
Developer Relations, Opera Software
Received on Tuesday, 11 December 2012 16:00:39 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:23:55 UTC