Re: Linked Data and Semantic Web CoolURIs, 303 redirects and Page Rank. from Michael Smethurst on 2014-07-23 (public-lod@w3.org from July 2014)

From: Michael Smethurst <michael.smethurst@bbc.co.uk>
Date: Wed, 23 Jul 2014 13:50:58 +0000
To: Bill Roberts <bill@swirrl.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <CFF57A1D.5DD07%michael.smethurst@bbc.co.uk>
Hi Bill

Bit of a difficult question to answer because the reality is probably
still quite disjoined. Various parts of bbc.co.uk:
- serve linked data
- store data as rdf (in a triple store)
- consume (to some extent) linked data

But nowhere are all those things true in one place. So /programmes
publishes linked data but the backend is a relational database, whereas
things like sport / olympics are stored as linked data but don't publish

So the 2 parts aren't really coupled

I do half remember lots of conversations about hashes v slashes for
/programmes and /music but the sites are designed to be quite granular
(one thing per uri; one uri per thing) so we weren't really dealing with
lots of things in a document

The linked data platform (our triple store) does use # uris like:

On 23/07/2014 14:19, "Bill Roberts" <bill@swirrl.com> wrote:

>Hi Michael
>
>We've tended to use slash URIs where possible, because have found it more
>convenient when doing URI dereferencing from a triple-store backed site -
>in which case we essentially do a DESCRIBE on the relevant URI.
>(So we do 303ing for non-information resources, though in practice in a
>lot of our applications, the great majority of content is statistical
>data, which we treat as information resources and respond with 200).
>
>How do you organise your data and generation of URI dereferencing
>responses with hash based URIs?  I can see a variety of ways to do it,
>but I'd be interested to know what you have found most
>efficient/convenient at the BBC - essentially dealing with the fact that
>the server doesn't know about what comes after the #
>
>
>Thanks
>
>Bill
>
>On 23 Jul 2014, at 13:52, Michael Smethurst <michael.smethurst@bbc.co.uk>
>wrote:
>
>> Hello
>> 
>> (Pretty sure I've made this comment before so please forgive any signs
>>of
>> premature senility)
>> 
>> I think this may be an unfortunate side effect of the conflation of the
>> 303 ("I can't send that") pattern with the content negotiation ("what
>> flavour would you like") pattern
>> 
>> Lots of linked data applications (like dbpedia) seem to couple the two
>> things together. So you have a "individual" uri which, when you attempt
>>to
>> dereference does a 303 *and* conneg in one step to the "display" uri:
>> /resource > 303+conneg > /data
>> or
>> /resource > 303+conneg > /page
>> 
>> 
>> Many other linked data sites seem to have followed this pattern but it
>> does seem, to my eyes, broke
>> 
>> At the BBC we have 3 flavours of uri. I'm not sure if these are the
>> appropriate / best labels but:
>> - the non-information resource uri. The uri that refers to the real
>>world
>> physical / metaphysical thing
>> - the generic information resource uri that identifies the document but
>> not any specific representation of the document
>> - the representation uri (the html or json or rdf-xml etc)
>> 
>> We tend to use hashes rather than slashes like
>> http://www.bbc.co.uk/programmes/b006mw1h#programme
>> 
>> 
>> But pretending we use slashes for a minute...
>> 
>> If you requested:
>> http://www.bbc.co.uk/programmes/b006mw1h/thing
>> 
>> 
>> You'd get a 303 redirect to the generic document / information resource
>> uri:
>> http://www.bbc.co.uk/programmes/b006mw1h
>> 
>> 
>> Which would then conneg to the appropriate representation which would
>> still be served from:
>> http://www.bbc.co.uk/programmes/b006mw1h
>> 
>> With a content location header of
>> http://www.bbc.co.uk/programmes/b006mw1h.rdf
>> 
>> For example
>> 
>> Whilst the rdf refers to the non-information resource uri when making
>> assertions about the "thing" this uri is not used elsewhere. All links
>>in
>> the html point to the generic document uri not to the non-information
>> resource uri
>> 
>> So crawlers like google just follow links from information resource to
>> information resource and never have to encounter 303s
>> 
>> Picking up a conneg penalty for every request isn't without problems
>> (particularly given CDN serving) but picking up a 303 penalty for every
>> request would be madness and not something we'd ever have been able to
>> implement
>> 
>> I do think the dbpedia conflation of 303 with conneg is an unhelpful
>> anti-pattern that people shouldn't be encouraged to follow. The conneg
>> part is just REST; "semantics" add the 303 onto that but they're not
>>doing
>> the same thing
>> 
>> Separating 303 from conneg still gives you "thing" vs document
>>separation,
>> still maintains cool uris and doesn't kill your servers
>> 
>> And we've never had a problem with seo
>> 
>> Hth
>> michael
>> 
>> 
>> 
>> 
>> On 18/07/2014 16:52, "Michael Brunnbauer" <brunni@netestate.de> wrote:
>> 
>>> 
>>> Hello Mark,
>>> 
>>> I cannot remember this important topic coming up earlier - which is a
>>>bit
>>> disturbing.
>>> 
>>> The problem would be migitated by people using the URI they see for
>>> linking.
>>> 
>>> Why not use the HTML URLs in the HTML pages for internal page rank
>>>flow?
>>> 
>>> How can URIs from sparql endpoints or OAI-PMH contribute to page rank?
>>> 
>>> A real problem would be RDFa where href also sets the object of a
>>>triple.
>>> 
>>> Regards,
>>> 
>>> Michael Brunnbauer
>>> 
>>> On Fri, Jul 18, 2014 at 10:05:17PM +1000, Mark Fallu wrote:
>>>> If the links we present to the outside world for harvesting eg. via
>>>> sparql
>>>> endpoint, OAI-PMH or open social widget etc is the canonical
>>>> "individual"
>>>> URI, clients will be able to get to the "display" url, but the google
>>>> page
>>>> rank that would normally flow from these external links will not.
>>> 
>>> 
>>> 
>>>> 
>>>> The specification of a 303 redirect describes it as:
>>>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
>>>> 
>>>>> "The response to the request can be found under a different URI and
>>>> SHOULD
>>>>> be retrieved using a GET method on that resource. This method exists
>>>>> primarily to allow the output of a POST-activated script to redirect
>>>> the
>>>>> user agent to a selected resource. *The new URI is not a substitute
>>>>> reference for the originally requested resource*. The 303 response
>>>> MUST
>>>>> NOT be cached, but the response to the second (redirected) request
>>>> might be
>>>>> cacheable.
>>>>> 
>>>> 
>>>> 
>>>> The different URI SHOULD be given by the Location field in the
>>>>response.
>>>>> Unless the request method was HEAD, the entity of the response SHOULD
>>>>> contain a short hypertext note with a hyperlink to the new URI(s)."
>>>> 
>>>> 
>>>> Google correctly implements the specification and does not assign the
>>>> page
>>>> rank of the "individual" URI to the "display" URL as it is "*not a
>>>> substitute reference for the originally requested resource".*
>>>> 
>>>> The same is true of internal links, a high page rank home page will
>>>>not
>>>> pass page rank on to "display" urls if the pathway to those urls is
>>>>via
>>>> "individual" uri links.
>>>> 
>>>> I am not sure what the solution is here as it seems the realms of SEO
>>>> and
>>>> the conventions of the web they are built on are not a good fit for
>>>> semantic web best practice.
>>>> 
>>>> The most minimal compromise I can think of is to move away from the
>>>>use
>>>> of
>>>> a 303 redirect to a redirect that conserves the flow of google page
>>>> rank.
>>>> 
>>>>   - "302 Found" redirect is the recommended replacement for 303 for
>>>>   clients that do not support HTTP 1.1  and it does allow a certain
>>>> amount of
>>>>   google page rank to flow.
>>>>   - "301 Moved Permanently" is a poor fit for the Cool URI pattern,
>>>>but
>>>>   passes on the full page rank of the links.
>>>>   - rewriting all URIs the URL would also work, but would break the
>>>>   coolURI pattern.
>>>> 
>>>> The pragmatist in me feels that if we are going to make a change for
>>>>the
>>>> purposes of SEO, it might as well be the one with best return, i.e.
>>>>301
>>>> redirect.
>>>> 
>>>> Note: Indexing is not the problem here, content is indexed.  The issue
>>>> relates to page rank not flowing through a 303 redirect.
>>>> 
>>>> I have tested and can confirm that 303 redirects are an issue for a
>>>> number
>>>> of reasons:
>>>> 
>>>>   - page rank does not flow through a 303 redirect
>>>>   - page rank can not be assigned from a url to a uri with a
>>>> rel=canonical
>>>>   tag if URI does a 303 redirect (preventing aggregation of pagerank
>>>> from
>>>>   external links to URL)
>>>>   - URI and URL are indexed separately
>>>>   - rdfa schema.org representations of URIs do not translate to URL
>>>> (ie.
>>>>   representation described at URL A, talking about URI B, does not get
>>>>   connected to representation described at URL B)
>>>>   - url parameters are not passed by a 303 redirect.
>>>>   - impact on functinality of google analytics tracking eg. traversing
>>>> the
>>>>   site is seen as a series of direct page visits.
>>>> 
>>>> Essentially - as far as search engines are concerned - every URL and
>>>> URI is
>>>> an island, with no connections between them.  At best a URL can
>>>>express
>>>> a
>>>> rel=canonical back to it's corresponding URI, no pagerank will flow
>>>> through
>>>> links.
>>>> 
>>>> Any guidance you can provide would be appreciated.
>>>> 
>>>> -- 
>>>> 
>>>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>> | Mark Fallu
>>>> | Manager, Research Data (Acting)
>>>> | Office for Research
>>>> | Bray Centre (N54) 0.10E
>>>> | Griffith University, Nathan Campus
>>>> | Queensland 4111 AUSTRALIA
>>>> |
>>>> | E-mail: m.fallu@griffith.edu.au
>>>> | Mobile:  04177 69778
>>>> | Phone:  +61 (07) 373 52069
>>>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>> 
>>> -- 
>>> ++  Michael Brunnbauer
>>> ++  netEstate GmbH
>>> ++  Geisenhausener Straße 11a
>>> ++  81379 München
>>> ++  Tel +49 89 32 19 77 80
>>> ++  Fax +49 89 32 19 77 89
>>> ++  E-Mail brunni@netestate.de
>>> ++  http://www.netestate.de/
>>> ++
>>> ++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
>>> ++  USt-IdNr. DE221033342
>>> ++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>>> ++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
>> 
>> 
>
Received on Wednesday, 23 July 2014 13:51:31 UTC