Re: Linked Data and Semantic Web CoolURIs, 303 redirects and Page Rank. from Michael Smethurst on 2014-07-23 (public-lod@w3.org from July 2014)

From: Michael Smethurst <michael.smethurst@bbc.co.uk>
Date: Wed, 23 Jul 2014 12:52:24 +0000
To: Michael Brunnbauer <brunni@netestate.de>, Mark Fallu <m.fallu@griffith.edu.au>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <CFF56852.5DCC8%michael.smethurst@bbc.co.uk>
Hello

(Pretty sure I've made this comment before so please forgive any signs of
premature senility)

I think this may be an unfortunate side effect of the conflation of the
303 ("I can't send that") pattern with the content negotiation ("what
flavour would you like") pattern

Lots of linked data applications (like dbpedia) seem to couple the two
things together. So you have a "individual" uri which, when you attempt to
dereference does a 303 *and* conneg in one step to the "display" uri:
/resource > 303+conneg > /data
or
/resource > 303+conneg > /page


Many other linked data sites seem to have followed this pattern but it
does seem, to my eyes, broke

At the BBC we have 3 flavours of uri. I'm not sure if these are the
appropriate / best labels but:
- the non-information resource uri. The uri that refers to the real world
physical / metaphysical thing
- the generic information resource uri that identifies the document but
not any specific representation of the document
- the representation uri (the html or json or rdf-xml etc)

We tend to use hashes rather than slashes like
http://www.bbc.co.uk/programmes/b006mw1h#programme


But pretending we use slashes for a minute...

If you requested:
http://www.bbc.co.uk/programmes/b006mw1h/thing


You'd get a 303 redirect to the generic document / information resource
uri:
http://www.bbc.co.uk/programmes/b006mw1h


Which would then conneg to the appropriate representation which would
still be served from:
http://www.bbc.co.uk/programmes/b006mw1h

With a content location header of
http://www.bbc.co.uk/programmes/b006mw1h.rdf

For example

Whilst the rdf refers to the non-information resource uri when making
assertions about the "thing" this uri is not used elsewhere. All links in
the html point to the generic document uri not to the non-information
resource uri

So crawlers like google just follow links from information resource to
information resource and never have to encounter 303s

Picking up a conneg penalty for every request isn't without problems
(particularly given CDN serving) but picking up a 303 penalty for every
request would be madness and not something we'd ever have been able to
implement

I do think the dbpedia conflation of 303 with conneg is an unhelpful
anti-pattern that people shouldn't be encouraged to follow. The conneg
part is just REST; "semantics" add the 303 onto that but they're not doing
the same thing

Separating 303 from conneg still gives you "thing" vs document separation,
still maintains cool uris and doesn't kill your servers

And we've never had a problem with seo

Hth
michael




On 18/07/2014 16:52, "Michael Brunnbauer" <brunni@netestate.de> wrote:

>
>Hello Mark,
>
>I cannot remember this important topic coming up earlier - which is a bit
>disturbing.
>
>The problem would be migitated by people using the URI they see for
>linking.
>
>Why not use the HTML URLs in the HTML pages for internal page rank flow?
>
>How can URIs from sparql endpoints or OAI-PMH contribute to page rank?
>
>A real problem would be RDFa where href also sets the object of a triple.
>
>Regards,
>
>Michael Brunnbauer
>
>On Fri, Jul 18, 2014 at 10:05:17PM +1000, Mark Fallu wrote:
>> If the links we present to the outside world for harvesting eg. via
>>sparql
>> endpoint, OAI-PMH or open social widget etc is the canonical
>>"individual"
>> URI, clients will be able to get to the "display" url, but the google
>>page
>> rank that would normally flow from these external links will not.
>
>
>
>> 
>> The specification of a 303 redirect describes it as:
>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
>> 
>> > "The response to the request can be found under a different URI and
>>SHOULD
>> > be retrieved using a GET method on that resource. This method exists
>> > primarily to allow the output of a POST-activated script to redirect
>>the
>> > user agent to a selected resource. *The new URI is not a substitute
>> > reference for the originally requested resource*. The 303 response
>>MUST
>> > NOT be cached, but the response to the second (redirected) request
>>might be
>> > cacheable.
>> >
>> 
>> 
>> The different URI SHOULD be given by the Location field in the response.
>> > Unless the request method was HEAD, the entity of the response SHOULD
>> > contain a short hypertext note with a hyperlink to the new URI(s)."
>> 
>> 
>> Google correctly implements the specification and does not assign the
>>page
>> rank of the "individual" URI to the "display" URL as it is "*not a
>> substitute reference for the originally requested resource".*
>> 
>>  The same is true of internal links, a high page rank home page will not
>> pass page rank on to "display" urls if the pathway to those urls is via
>> "individual" uri links.
>> 
>> I am not sure what the solution is here as it seems the realms of SEO
>>and
>> the conventions of the web they are built on are not a good fit for
>> semantic web best practice.
>> 
>> The most minimal compromise I can think of is to move away from the use
>>of
>> a 303 redirect to a redirect that conserves the flow of google page
>>rank.
>> 
>>    - "302 Found" redirect is the recommended replacement for 303 for
>>    clients that do not support HTTP 1.1  and it does allow a certain
>>amount of
>>    google page rank to flow.
>>    - "301 Moved Permanently" is a poor fit for the Cool URI pattern, but
>>    passes on the full page rank of the links.
>>    - rewriting all URIs the URL would also work, but would break the
>>    coolURI pattern.
>> 
>> The pragmatist in me feels that if we are going to make a change for the
>> purposes of SEO, it might as well be the one with best return, i.e. 301
>> redirect.
>> 
>> Note: Indexing is not the problem here, content is indexed.  The issue
>> relates to page rank not flowing through a 303 redirect.
>> 
>> I have tested and can confirm that 303 redirects are an issue for a
>>number
>> of reasons:
>> 
>>    - page rank does not flow through a 303 redirect
>>    - page rank can not be assigned from a url to a uri with a
>>rel=canonical
>>    tag if URI does a 303 redirect (preventing aggregation of pagerank
>>from
>>    external links to URL)
>>    - URI and URL are indexed separately
>>    - rdfa schema.org representations of URIs do not translate to URL
>>(ie.
>>    representation described at URL A, talking about URI B, does not get
>>    connected to representation described at URL B)
>>    - url parameters are not passed by a 303 redirect.
>>    - impact on functinality of google analytics tracking eg. traversing
>>the
>>    site is seen as a series of direct page visits.
>> 
>> Essentially - as far as search engines are concerned - every URL and
>>URI is
>> an island, with no connections between them.  At best a URL can express
>>a
>> rel=canonical back to it's corresponding URI, no pagerank will flow
>>through
>> links.
>> 
>> Any guidance you can provide would be appreciated.
>> 
>> -- 
>> 
>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>> | Mark Fallu
>> | Manager, Research Data (Acting)
>> | Office for Research
>> | Bray Centre (N54) 0.10E
>> | Griffith University, Nathan Campus
>> | Queensland 4111 AUSTRALIA
>> |
>> | E-mail: m.fallu@griffith.edu.au
>> | Mobile:  04177 69778
>> | Phone:  +61 (07) 373 52069
>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>-- 
>++  Michael Brunnbauer
>++  netEstate GmbH
>++  Geisenhausener Straße 11a
>++  81379 München
>++  Tel +49 89 32 19 77 80
>++  Fax +49 89 32 19 77 89
>++  E-Mail brunni@netestate.de
>++  http://www.netestate.de/
>++
>++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
>++  USt-IdNr. DE221033342
>++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Received on Wednesday, 23 July 2014 12:53:06 UTC