RE: Suggestion: 'rel="unrelated"' from Dan Brickley on 2005-01-22 (www-html@w3.org from January 2005)

From: Dan Brickley <danbri@w3.org>
Date: Sat, 22 Jan 2005 21:52:55 +0000
To: www-html@w3.org
Cc: dean@w3.org, shellen@google.com, kmarks@technorati.com, tantek@technorati.com
Message-ID: <20050122215254.GH9098@homer.w3.org>
I've been following the threads in 
http://lists.w3.org/Archives/Public/www-html/2005Jan/
about the 
http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
proposal, and the draft definition at 
http://developers.technorati.com/wiki/RelNoFollow

I share the various www-html misgivings about the name, but 
am otherwise quite optimistic. Mark's review comments 
sugggest this is just for from blog content, but I think 
that's not quite it. There are a number of scenarios (eg.
the automatic hyperlinking of URLs in emails archived
by W3C at lists.w3.org, or in Wikis) where untrusted or
semi-trusted hypertext content is published on the Web.

While a normal hyperlink doesn't 'formally' assert that the 
linking document (or its author or hosting site) endorses
the referenced content, it is nevertheless true that 
in the normal course of civilised Web behaviour there 
are statistical trends that can be very usefully 
exploited. Most Web content doesn't contain random 
links to unrelated or horribly spammy sites. Rather, 
it tends to link to something that the document 
author(s) have considered pertinent to their own content.

I welcome the "nofollow" effort because it shows there 
is willingness from a broad group to work towards 
making the intent behind hyperlinks a little more 
evident to computers. The name I think does mislead.
The name is in the tradition of the robots.txt and 
<meta name="robots" content="noindex,nofollow">
construct (talking of which, is [1] the best 
existing documentation for deployed practice?).

The "don't follow this link" sense seems overly 
imperative. I wonder if the name could be retained, 
but given a more passive, declarative reading. I don't 
believe we should get into the business of taxonomising
user-agent types (classic Web browsers, blog readers,
voice browsers, offline readers, search engines, directories,
feedster/technorati/etc...). If the intent were 
really that the link not be followed by certain kinds of 
agent, the spec ought really to say which kind. Which 
seems a fruitless path, since it gets stuck in implementation
details instead of focussing on the core business: providing
all such agents with evidence that assists evaluation of 
the significance of the hyperlink and the chances that 
it references something dodgy. IMHO "nofollow" is a small 
part of a bigger story, but one that's worth getting 
spec'd up properly (eg. so something like it could be 
used in SVG, etc).

http://developers.technorati.com/wiki/RelNoFollow
has (currently; it's a wiki) the following text:

> nofollow
> Indicates that the referred resource was not necessarily linked to 
> by the author of the page, and thus said reference should not
> afford the referred resource any additional weight or ranking by user
> agents.

The name and definition seem disconnected somewhat, and the 
reference to 'weight' and 'ranking' allude to popular knowledge 
of techniques such as those used by Google, without really defining
them. It also appeals to a notion of "author of page" that 
may well not be universally applicable (eg. who is "the author" of 
the page whose URI is [2]? Björn? webmaster@w3.org? nobody?).

I would like to see "nofollow" (or a renaming) couched purely 
in terms of a relationship between two documents, minimising
references to other entities such as author(s), 
publisher(s), search engines, user agents etc. 

Here's an experiment to that end, following something of 
the style used in http://www.w3.org/TR/html401/types.html#type-links

 nofollow: "Refers to a document whose contents do not 
 necessarily follow in any way from the topic or themes 
 of the current document. This type of relationship 
 is typically expressed within a document that includes 
 hypertext content whose origin is unknown or untrusted."

 (elaborating)
 Examples include user-supplied comments, feedback forms,
 aggregated content, weblog trackback excerpts, Wiki systems,
 HTML views of mail/news content, discussion boards,
 and Web-based email clients.

 The "nofollow" relationship is designed to provide a simple 
 construct that can be used when (re)publishing pieces of 
 hypertext by adding "nofollow" to a rel attribute. The 
 absence of such an attribute in no way implies an endorsement 
 of the linked document; "nofollow" simply provides one 
 very basic mechanism for representing skepticism about 
 referenced content. Richer metadata (RDFXML, PICS etc)
 can be used in applications that need more than the basic 
 information provided by "nofollow".

Having attempted this, I'm not 100% myself convinced yet. I can see 
"nofollow" fever taking off in a way that could obscure the 
original motivating usecases. Eg. political blogs using it to 
cancel out link-karma to sites they critique; newspaper sites 
using it to avoid boosting or appearing to endorse blog 
articles, etc. I also like Mark Birbeck's suggestion that 
entire DIV'd or class'd sections of a page might be marked 
in this way, rather than focussing solely on hyperlinks.

In passing, http://developers.technorati.com/wiki/VoteLinks
is pretty interesting, although a much bigger undertaking 
than "nofollow"...

cheers,

Dan

[1] http://www.robotstxt.org/wc/meta-user.html
	The Robots META tag is a simple mechanism to 
	indicate to visiting Web Robots if a page 
	should be indexed, or links on the page should be
	followed.
[2] http://lists.w3.org/Archives/Public/www-html/2005Jan/0089.html
Received on Saturday, 22 January 2005 22:04:40 UTC