W3C home > Mailing lists > Public > www-archive@w3.org > July 2008

Social network and opacity Re: Robots.txt and Information aggregation

From: Karl Dubost <karl@w3.org>
Date: Fri, 25 Jul 2008 05:44:54 +0900
Cc: www-archive <www-archive@w3.org>
Message-Id: <13EF23CC-D2EC-4440-9878-065D1C9E281D@w3.org>
To: Christopher Blizzard <blizzard@whoisi.com>
Hi Chris,

Thanks for answering and many thanks for giving the possibility of a  
public debate (Copy to www-archive@w3.org with chris authorization.)

There are more consequences than only whoisi it's why I wanted it to  
be public.

Le 24 juil. 2008 à 00:44, Christopher Blizzard a écrit :

> Hi, Karl!  I would love some of your thoughts on some of the things  
> that I mention below.
>
> On Jul 22, 2008, at 9:35 PM, Karl Dubost wrote:
>
>> Hi Christophe,
>>
>> I have noticed this morning that you were aggregating pieces of my  
>> personae under the name Karl Dubost conflating different  
>> personalities: Professional and Personal
>> 	http://whoisi.com/p/3168


ok. Note that people makes the connections because it is possible to  
connect on a site. The devil is indeed out of the box. And many people  
will do because they just can do. Or because they don't realize what  
they are doing. On the other side, by setting up a system which  
enables this, there is more responsibility.

> Yeah, I've seen some people who have problems with that.  I'm not  
> entirely sure what to do about that given that it's all user-driven  
> data, it's not aggregated by robots or programs.  Part of the thing  
> about whoisi is it makes those disparate connections possible.

Opacity is the property for a medium to let light go through (more  
exactly it is the mean distance that a photon has between two  
interactions with the medium.)

Opacity on the network is greatly reduces because the time and the  
space have been really compressed. It has benefits and big drawbacks  
for the intimacy of people. When I'm walking in a city, I'm in a  
public space. The local people who might see me and sometimes  
recognize, might indeed propagate the information about me. But in  
this action, they will give a partial rendering, they will forget  
after a few days, it will take time for the information to travel  
between individuals. Opacity maintains the social glue.

A system where everything you say, express is automatically rendered  
identically (copy at different places), kept (search engines), and  
transmitted quickly (internet) has strong consequences for the  
individuals which are not all good.

When I speak in a cafe with friend, someone might hear me, but I don't  
have to protect me. On the network, these days, I have to be careful,  
and take big care of the level of access I give to my information. It  
modifies deeply the way I have to deal with my casual information.


>> I'm really careful about this and I want opt-in systems not opt-out.
>> I have removed them for now. I'm pretty sure someone will add them  
>> again. I wish not, but we will see.
>>
>> But there is one thing which seems to be really bogus in your  
>> system. One of the feeds you were aggregating is
>> 	http://www.la-grange.net/feed.rdf
>>
>> I encourage you to see
>> 	http://www.la-grange.net/robots.txt
>>
>> It is explicit
>> 	User-agent: *
>> 	Disallow: /
>>
>> Please fix your RSS reader that it will enforce the robot protocol.
>> 	http://www.robotstxt.org/
>
> This is an honest question: What is your expectation about how RSS  
> readers real with the robots.txt file?

Here I make a distinction between a human and a Web site. That creates  
a big difference. A person who is reading my Web site through a RSS  
reader has made a decision to do so. My content being aggregated by an  
engine which is not under the direct control of someone makes it a no  
no. It is becoming a bot. Exactly the same way I make a difference  
between a browser (individual control and choice) and between a search  
engine bot (anonymously collective).

My expectation is it depends on the type of reader, how it  
redistributes the content, to who, etc. You can't say to the user  
agents on the network for now on how the content you have created  
should be reproduced.

>  For example, google reader happily adds your site as an rss feed  
> and clearly has been aggregating data on it for a while. (It has  
> history much longer than what's in your RSS feed, for example.)

Hmm I'll have to check because I thought I blocked them. The reason  
was that I had nothing against the RSS reader of Google itself, but  
the fact that despite of my robots.txt, RSS Reader was feeding Google  
Search database with the titles and the links bypassing the robots.txt.

> whoisi is essentially a big shared rss reader.  Do you think that  
> the rules for whoisi should be different than for something like  
> google reader?

Human versus machine. Yes.

> The opening page for robotstxt.org contains this phrase:
>
> "Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are  
> programs that traverse the Web automatically. Search engines such as  
> Google use them to index the web content, spammers use them to scan  
> for email addresses, and they have many other uses."

*automatically* and not the individual choice of someone.

> whoisi does no wandering, has no crawlers or spiders.  Everything  
> that is done is driven by user interaction.  It's driven by humans,  
> not robots. :)

Yes. Basically you are demonstrating the effect of mobs. Flash mob  
could be used for fun, for the benefits of a "good" projects or for  
tracking down people with nasty effects. All individual people  
thinking that they don't do harm.

> I'd love to have a way to mark things as "don't aggregate this RSS  
> with other entries" but robots.txt doesn't seem like quite the right  
> tool.  It's very brute force and given the robots.txt that are on  
> sites like twitter.com, where I do pull a lot of data, it would keep  
> whoisi from pulling information from them.  It doesn't seem like the  
> right tool for that kind of job.  It's aimed at spiders, not rss  
> readers.

Maybe it could be something in the different RSS feed, an element in  
Atom, RSS 2.0 and RSS 1.0 which informs that automatic aggregation is  
not accepted. That's an interesting topic. I guess I will discuss it  
with participants at iCommons Summit in Hokkaido, next week.

Though I have issues with a specific statement on your content, be RSS  
feed and HTML, etc.
By making a statement against some aggregation type (asking more  
opacity), you are making yourself more visible (Recently the Boring  
couple and their house in GoogleMaps street view).  you could say for  
example, do not aggregate my content based on geographical filtering.  
Or do not aggregate if you intend to do commercial uses of it.

> I'm asking people for suggestions on what might work in terms of how  
> to avoid aggregating those kinds of things, to try and protect and  
> enhance privacy where I can, but the tools aren't quite there.  What  
> do you suggest?


My personal opinion for aggregation, indexing, etc. is to give the  
power back to people. Every aggregations should be opt-in and not opt- 
out. opt-out systems are far too complex for most of the people.


Not many people can add to their Web site, this kind of information in  
a .htaccess

SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent "Microsoft Office" bad_bot
SetEnvIfNoCase User-Agent ".*QihooBot*." bad_bot
SetEnvIfNoCase User-Agent ".*CazoodleBot*." bad_bot
SetEnvIfNoCase User-Agent ".*Acoon-Robot*." bad_bot
SetEnvIfNoCase User-Agent ".*Gigamega*." bad_bot
SetEnvIfNoCase User-Agent ".*MJ12bot*." bad_bot
SetEnvIfNoCase User-Agent ".*yacybot*." bad_bot
SetEnvIfNoCase User-Agent ".*Moreoverbot*." bad_bot
SetEnvIfNoCase User-Agent ".*Tailrank*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
SetEnvIfNoCase User-Agent ".*NIF/1.1*." bad_bot
SetEnvIfNoCase User-Agent ".*SnapPreviewBot*." bad_bot
SetEnvIfNoCase User-Agent ".*Feedfetcher-Google*." bad_bot
SetEnvIfNoCase User-Agent ".*SPIP-1.8.2*." bad_bot
SetEnvIfNoCase User-Agent ".*whoisi*." bad_bot
Order Allow,Deny
Allow from all
Deny from env=bad_bot


Thanks for starting the discussion.


Other references for this discussion

Mitchell Baker has recently published "Why focus on data?"
http://blog.lizardwrangler.com/2008/07/22/why-focus-on-data/

There is also the text from Daniel Weitzner "Reciprocal Privacy (ReP)  
for the Social Web"
http://dig.csail.mit.edu/2007/12/rep.html


-- 
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool
Received on Thursday, 24 July 2008 20:45:57 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 7 November 2012 14:18:19 GMT