Re: Linked Data and Semantic Web CoolURIs, 303 redirects and Page Rank. from Mark Fallu on 2014-07-18 (public-lod@w3.org from July 2014)

From: Mark Fallu <m.fallu@griffith.edu.au>
Date: Sat, 19 Jul 2014 07:00:21 +1000
To: Paul Houle <ontology2@gmail.com>
Cc: Linked Data community <public-lod@w3.org>
Message-Id: <E3DE5224-036F-4DE0-A830-F06C4A6A0F7F@griffith.edu.au>
That is a fair point - but I would still suggest that it is important for search engines to be able to meaningfully interpret:
- internal links
- rdfa representations that span multiple pages.

Cheers,

Mark

Sent from my iPhone

> On 19 Jul 2014, at 3:02 am, Paul Houle <ontology2@gmail.com> wrote:
> 
> Frankly I don't care about PageRank,  and these days I don't know if
> Google does.  These days Google gets direct sampling of user behavior
> through Chrome and Google Analytics,  and this sort of data is
> probably much more valuable than the link graph since they know about
> things like time-on-page,  query chains,  and things like that.
> 
> If anything,  PageRank,  or what people imagine about PageRank has
> been harmful to the web because it's created a situation where people
> just don't make links to other web sites anymore.  It started with
> high profile sites (ex. engadget) that just wanted to be greedy and
> not give any PageRank to their competition.  Then you saw people using
> the NOFOLLOW attribute because they thought that this too was a way to
> be greedy.
> 
> Ten years ago I got a lot of emails from people that amounted to "I
> will pay you $X if you make a link on page Y to page Z with anchor
> text T".  You'd also find SEO firms that would ask for $X a month to
> generate Y links to your site.
> 
> Recently Google made some changes and they seem to be punishing people
> who have inappropriate links so now people get emails like "Would you
> please remove the link from page X to page Y" and the new thing is
> that SEO firms now want you to pay them $X to remove Y links to your
> site.
> 
> I think it is all a lot of bull and I make whatever links I like and
> figure that Google is going to do whatever it is they are going to do.
> 
> ᐧ
> 
>> On Fri, Jul 18, 2014 at 8:05 AM, Mark Fallu <m.fallu@griffith.edu.au> wrote:
>> I am attempting to understand how the the CoolURI 303 redirect pattern for
>> the semantic web (http://www.w3.org/TR/cooluris/) can be implemented without
>> negative impact on search engines.
>> 
>> This pattern appears to allow site content to be indexed, but prevents page
>> rank from flowing through internal links due to the use of a 303 redirect.
>> 
>> For example in Griffith's Research-Hub: http://research-hub.griffith.edu.au
>> 
>> A get request to the URI of Howard Wiseman:
>> http://research-hub.griffith.edu.au/individual/n33a4e2d3057476efaff5ce1884564a8f
>> 
>> Will resolve to different urls based on content negotiation.
>> 
>> For RDF:
>> wget --header "Accept: application/rdf+xml"
>> http://research-hub.griffith.edu.au/individual/n33a4e2d3057476efaff5ce1884564a8f
>> 
>> results in a "303 see other" redirect to the RDF version of the entity:
>> http://research-hub.griffith.edu.au/rdf/n33a4e2d3057476efaff5ce1884564a8f/n33a4e2d3057476efaff5ce1884564a8f.rdf
>> 
>> For HTML:
>> wget --header "Accept: text/html"
>> http://research-hub.griffith.edu.au/individual/n33a4e2d3057476efaff5ce1884564a8f
>> results in a "303 see other" redirect to the HTML version of the entity (our
>> old friend the "display" version:
>> http://research-hub.griffith.edu.au/display/n33a4e2d3057476efaff5ce1884564a8f
>> 
>> Note: There will never be a HTML page at
>> http://research-hub.griffith.edu.au/individual/n33a4e2d3057476efaff5ce1884564a8f
>> just a HTTP response
>> 
>> Links will be presented as the "individual" uri and then redirect to the
>> "display" url.
>> 
>> All good so far - this is a perfectly functional example of the Cool URI
>> specification at work.  Unfortunately it results in a few issues in
>> practice.
>> 
>> If the links we present to the outside world for harvesting eg. via sparql
>> endpoint, OAI-PMH or open social widget etc is the canonical "individual"
>> URI, clients will be able to get to the "display" url, but the google page
>> rank that would normally flow from these external links will not.
>> 
>> The specification of a 303 redirect describes it as:
>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
>>> 
>>> "The response to the request can be found under a different URI and SHOULD
>>> be retrieved using a GET method on that resource. This method exists
>>> primarily to allow the output of a POST-activated script to redirect the
>>> user agent to a selected resource. The new URI is not a substitute reference
>>> for the originally requested resource. The 303 response MUST NOT be cached,
>>> but the response to the second (redirected) request might be cacheable.
>>> 
>>> 
>>> 
>>> The different URI SHOULD be given by the Location field in the response.
>>> Unless the request method was HEAD, the entity of the response SHOULD
>>> contain a short hypertext note with a hyperlink to the new URI(s)."
>> 
>> 
>> Google correctly implements the specification and does not assign the page
>> rank of the "individual" URI to the "display" URL as it is "not a substitute
>> reference for the originally requested resource".
>> 
>> The same is true of internal links, a high page rank home page will not pass
>> page rank on to "display" urls if the pathway to those urls is via
>> "individual" uri links.
>> 
>> I am not sure what the solution is here as it seems the realms of SEO and
>> the conventions of the web they are built on are not a good fit for semantic
>> web best practice.
>> 
>> The most minimal compromise I can think of is to move away from the use of a
>> 303 redirect to a redirect that conserves the flow of google page rank.
>> 
>> "302 Found" redirect is the recommended replacement for 303 for clients that
>> do not support HTTP 1.1  and it does allow a certain amount of google page
>> rank to flow.
>> "301 Moved Permanently" is a poor fit for the Cool URI pattern, but passes
>> on the full page rank of the links.
>> rewriting all URIs the URL would also work, but would break the coolURI
>> pattern.
>> 
>> The pragmatist in me feels that if we are going to make a change for the
>> purposes of SEO, it might as well be the one with best return, i.e. 301
>> redirect.
>> 
>> Note: Indexing is not the problem here, content is indexed.  The issue
>> relates to page rank not flowing through a 303 redirect.
>> 
>> I have tested and can confirm that 303 redirects are an issue for a number
>> of reasons:
>> 
>> page rank does not flow through a 303 redirect
>> page rank can not be assigned from a url to a uri with a rel=canonical tag
>> if URI does a 303 redirect (preventing aggregation of pagerank from external
>> links to URL)
>> URI and URL are indexed separately
>> rdfa schema.org representations of URIs do not translate to URL (ie.
>> representation described at URL A, talking about URI B, does not get
>> connected to representation described at URL B)
>> url parameters are not passed by a 303 redirect.
>> impact on functinality of google analytics tracking eg. traversing the site
>> is seen as a series of direct page visits.
>> 
>> Essentially - as far as search engines are concerned - every URL and URI is
>> an island, with no connections between them.  At best a URL can express a
>> rel=canonical back to it's corresponding URI, no pagerank will flow through
>> links.
>> 
>> 
>> Any guidance you can provide would be appreciated.
>> 
>> --
>> 
>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>> | Mark Fallu
>> | Manager, Research Data (Acting)
>> | Office for Research
>> | Bray Centre (N54) 0.10E
>> | Griffith University, Nathan Campus
>> | Queensland 4111 AUSTRALIA
>> |
>> | E-mail: m.fallu@griffith.edu.au
>> | Mobile:  04177 69778
>> | Phone:  +61 (07) 373 52069
>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> 
> 
> 
> -- 
> Paul Houle
> Expert on Freebase, DBpedia, Hadoop and RDF
> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
Received on Friday, 18 July 2014 21:00:50 UTC