Re: Linked Data and Semantic Web CoolURIs, 303 redirects and Page Rank. from Michael Smethurst on 2014-07-23 (public-lod@w3.org from July 2014)

From: Michael Smethurst <michael.smethurst@bbc.co.uk>
Date: Wed, 23 Jul 2014 18:05:08 +0000
To: "john.walker" <john.walker@semaku.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <CFF58FEB.5DD34%michael.smethurst@bbc.co.uk>
On 23/07/2014 15:50, "john.walker" <john.walker@semaku.com> wrote:

>Hi Michael,

Hiya

> 
>  
>Hope the laptop is ok :)

Survived another drop

> 
>  
>So I can think of your 'slash' NIR URI as something similar to a URN:
>http://www.bbc.co.uk/programmes/b006mw1h/thing
>  
>It doesn't do much on it's own and *just* acts as an identifier.


(Ignoring the fact we actually use hashes, if we did use slashes, then)
yes. It's just an identifier for a "real world thing". The rdf and rdfa
use it to make assertions:

<a typeof="po:Brand" about="/programmes/b006mw1h#programme"
href="/programmes/b006mw1h" title="Gardeners' World">

but @href links don't travel through them


> 
>Using HTTP it can be resolved to a URL via the 303, kind of similar to a
>URN resolver.

Guess so yes, a urn that doesn't need a urn resolver cos it's an http
uri....
>
>  
>Could you explain what you mean by "conneg penalty"?

Every time a "normal user" clicks a link on the bits of bbc.co.uk that
support linked data, they click to the "generic document resource uri"
which then does the conneg bit to serve an appropriate representation. So
it's extra work at the server end but mostly cachable. Except a bit tricky
with CDNs

> 
>  
>I've set up an application working with 303s and, although I don't
>consider myself mad, it does add an extra request to every click the user
>does.

Guess the madness quotient would depend on how much traffic you have to
cope with. For the BBC to add an additional request for every request for
a doctor who page would have been madness

>
>Getting the 303 response takes 20 - 25 ms on average, so it's not a big
>issue in this case (internal company usage).

For internal usage it's all probably fine. But I still think it's a
pattern that shouldn't be generally encouraged. On a high traffic website
it's just more requests that aren't really adding anything. I think if
we'd suggested the dbpedia style pattern at the BBC we'd never have gotten
permission to serve linked data
>
>  
>Interestingly enough I just checked a random shortened link off Twitter
>and it went through no less than 5 HTTP 301/302 redirects (500 ms in
>total) before getting the HTML.

Yeah, it's a shambles init :-/
>
>Taking that into consideration a single 303 is not too bad!

In comparison to link shortener madness it's not that mad. But it's a
redirect your servers have to handle and link shorteners are someone
else's problem. Kinda

michael
> 
>  
>Regards, 
>
>John Walker 
>
>
>
>> On July 23, 2014 at 3:55 PM Michael Smethurst
>><michael.smethurst@bbc.co.uk> wrote:
>
>> 
>> 
>> Oops, dropped laptop :-/
>> 
>> Continues.... 
>> 
>> On 23/07/2014 14:50, "Michael Smethurst" <michael.smethurst@bbc.co.uk>
>> wrote: 
>> 
>> >Hi Bill 
>> > 
>> >Bit of a difficult question to answer because the reality is probably
>> >still quite disjointed. Various parts of bbc.co.uk:
>> >- serve linked data
>> >- store data as rdf (in a triple store)
>> >- consume (to some extent) linked data
>> > 
>> >But nowhere are all those things true in one place. So /programmes
>> >publishes linked data but the backend is a relational database,
>>whereas 
>> >things like sport / olympics are stored as linked data but don't
>>publish 
>> > 
>> >So the 2 parts aren't really coupled
>> > 
>> >I do half remember lots of conversations about hashes v slashes for
>> >/programmes and /music but the sites are designed to be quite granular
>> >(one thing per uri; one uri per thing) so we weren't really dealing
>>with 
>> >lots of things in a document
>> > 
>> >The linked data platform (our triple store) does use # uris like:
>> http://www.bbc.co.uk/things/794274f1-d7ea-4ad2-9b36-c46ed55da9bd#id
>> 
>> 
>> But I'm not best placed to know about the interfaces and queries onto
>>this 
>> and why they chose hashes and not slashes. I'll ask around unless those
>> people are already on this list...
>> 
>> Not much help 
>> Sorry 
>> michael 
>> > 
>> >On 23/07/2014 14:19, "Bill Roberts" <bill@swirrl.com> wrote:
>> > 
>> >>Hi Michael 
>> >> 
>> >>We've tended to use slash URIs where possible, because have found it
>>more 
>> >>convenient when doing URI dereferencing from a triple-store backed
>>site - 
>> >>in which case we essentially do a DESCRIBE on the relevant URI.
>> >>(So we do 303ing for non-information resources, though in practice in
>>a 
>> >>lot of our applications, the great majority of content is statistical
>> >>data, which we treat as information resources and respond with 200).
>> >> 
>> >>How do you organise your data and generation of URI dereferencing
>> >>responses with hash based URIs? I can see a variety of ways to do it,
>> >>but I'd be interested to know what you have found most
>> >>efficient/convenient at the BBC - essentially dealing with the fact
>>that 
>> >>the server doesn't know about what comes after the #
>> >> 
>> >> 
>> >>Thanks 
>> >> 
>> >>Bill 
>> >> 
>> >>On 23 Jul 2014, at 13:52, Michael Smethurst
>><michael.smethurst@bbc.co.uk>
>> >>wrote: 
>> >> 
>> >>> Hello 
>> >>> 
>> >>> (Pretty sure I've made this comment before so please forgive any
>>signs 
>> >>>of 
>> >>> premature senility)
>> >>> 
>> >>> I think this may be an unfortunate side effect of the conflation of
>>the 
>> >>> 303 ("I can't send that") pattern with the content negotiation
>>("what 
>> >>> flavour would you like") pattern
>> >>> 
>> >>> Lots of linked data applications (like dbpedia) seem to couple the
>>two 
>> >>> things together. So you have a "individual" uri which, when you
>>attempt 
>> >>>to 
>> >>> dereference does a 303 *and* conneg in one step to the "display"
>>uri: 
>> >>> /resource > 303+conneg > /data
>> >>> or 
>> >>> /resource > 303+conneg > /page
>> >>> 
>> >>> 
>> >>> Many other linked data sites seem to have followed this pattern but
>>it 
>> >>> does seem, to my eyes, broke
>> >>> 
>> >>> At the BBC we have 3 flavours of uri. I'm not sure if these are the
>> >>> appropriate / best labels but:
>> >>> - the non-information resource uri. The uri that refers to the real
>> >>>world 
>> >>> physical / metaphysical thing
>> >>> - the generic information resource uri that identifies the document
>>but 
>> >>> not any specific representation of the document
>> >>> - the representation uri (the html or json or rdf-xml etc)
>> >>> 
>> >>> We tend to use hashes rather than slashes like
>> >>> http://www.bbc.co.uk/programmes/b006mw1h#programme
>> >>> 
>> >>> 
>> >>> But pretending we use slashes for a minute...
>> >>> 
>> >>> If you requested:
>> >>> http://www.bbc.co.uk/programmes/b006mw1h/thing
>> >>> 
>> >>> 
>> >>> You'd get a 303 redirect to the generic document / information
>>resource 
>> >>> uri: 
>> >>> http://www.bbc.co.uk/programmes/b006mw1h
>> >>> 
>> >>> 
>> >>> Which would then conneg to the appropriate representation which
>>would 
>> >>> still be served from:
>> >>> http://www.bbc.co.uk/programmes/b006mw1h
>> >>> 
>> >>> With a content location header of
>> >>> http://www.bbc.co.uk/programmes/b006mw1h.rdf
>> >>> 
>> >>> For example 
>> >>> 
>> >>> Whilst the rdf refers to the non-information resource uri when
>>making 
>> >>> assertions about the "thing" this uri is not used elsewhere. All
>>links 
>> >>>in 
>> >>> the html point to the generic document uri not to the
>>non-information 
>> >>> resource uri
>> >>> 
>> >>> So crawlers like google just follow links from information resource
>>to 
>> >>> information resource and never have to encounter 303s
>> >>> 
>> >>> Picking up a conneg penalty for every request isn't without
>>problems 
>> >>> (particularly given CDN serving) but picking up a 303 penalty for
>>every 
>> >>> request would be madness and not something we'd ever have been able
>>to 
>> >>> implement 
>> >>> 
>> >>> I do think the dbpedia conflation of 303 with conneg is an
>>unhelpful 
>> >>> anti-pattern that people shouldn't be encouraged to follow. The
>>conneg 
>> >>> part is just REST; "semantics" add the 303 onto that but they're
>>not 
>> >>>doing 
>> >>> the same thing
>> >>> 
>> >>> Separating 303 from conneg still gives you "thing" vs document
>> >>>separation, 
>> >>> still maintains cool uris and doesn't kill your servers
>> >>> 
>> >>> And we've never had a problem with seo
>> >>> 
>> >>> Hth 
>> >>> michael 
>> >>> 
>> >>> 
>> >>> 
>> >>> 
>> >>> On 18/07/2014 16:52, "Michael Brunnbauer" <brunni@netestate.de>
>>wrote: 
>> >>> 
>> >>>> 
>> >>>> Hello Mark,
>> >>>> 
>> >>>> I cannot remember this important topic coming up earlier - which
>>is a 
>> >>>>bit 
>> >>>> disturbing.
>> >>>> 
>> >>>> The problem would be migitated by people using the URI they see
>>for 
>> >>>> linking. 
>> >>>> 
>> >>>> Why not use the HTML URLs in the HTML pages for internal page rank
>> >>>>flow? 
>> >>>> 
>> >>>> How can URIs from sparql endpoints or OAI-PMH contribute to page
>>rank? 
>> >>>> 
>> >>>> A real problem would be RDFa where href also sets the object of a
>> >>>>triple. 
>> >>>> 
>> >>>> Regards, 
>> >>>> 
>> >>>> Michael Brunnbauer
>> >>>> 
>> >>>> On Fri, Jul 18, 2014 at 10:05:17PM +1000, Mark Fallu wrote:
>> >>>>> If the links we present to the outside world for harvesting eg.
>>via 
>> >>>>> sparql 
>> >>>>> endpoint, OAI-PMH or open social widget etc is the canonical
>> >>>>> "individual"
>> >>>>> URI, clients will be able to get to the "display" url, but the
>>google 
>> >>>>> page 
>> >>>>> rank that would normally flow from these external links will not.
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>>> 
>> >>>>> The specification of a 303 redirect describes it as:
>> >>>>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
>> >>>>> 
>> >>>>>> "The response to the request can be found under a different URI
>>and 
>> >>>>> SHOULD 
>> >>>>>> be retrieved using a GET method on that resource. This method
>>exists 
>> >>>>>> primarily to allow the output of a POST-activated script to
>>redirect 
>> >>>>> the 
>> >>>>>> user agent to a selected resource. *The new URI is not a
>>substitute 
>> >>>>>> reference for the originally requested resource*. The 303
>>response 
>> >>>>> MUST 
>> >>>>>> NOT be cached, but the response to the second (redirected)
>>request 
>> >>>>> might be 
>> >>>>>> cacheable.
>> >>>>>> 
>> >>>>> 
>> >>>>> 
>> >>>>> The different URI SHOULD be given by the Location field in the
>> >>>>>response. 
>> >>>>>> Unless the request method was HEAD, the entity of the response
>> >>>>>>SHOULD 
>> >>>>>> contain a short hypertext note with a hyperlink to the new
>>URI(s)." 
>> >>>>> 
>> >>>>> 
>> >>>>> Google correctly implements the specification and does not assign
>>the 
>> >>>>> page 
>> >>>>> rank of the "individual" URI to the "display" URL as it is "*not
>>a 
>> >>>>> substitute reference for the originally requested resource".*
>> >>>>> 
>> >>>>> The same is true of internal links, a high page rank home page
>>will 
>> >>>>>not 
>> >>>>> pass page rank on to "display" urls if the pathway to those urls
>>is 
>> >>>>>via 
>> >>>>> "individual" uri links.
>> >>>>> 
>> >>>>> I am not sure what the solution is here as it seems the realms of
>>SEO 
>> >>>>> and 
>> >>>>> the conventions of the web they are built on are not a good fit
>>for 
>> >>>>> semantic web best practice.
>> >>>>> 
>> >>>>> The most minimal compromise I can think of is to move away from
>>the 
>> >>>>>use 
>> >>>>> of 
>> >>>>> a 303 redirect to a redirect that conserves the flow of google
>>page 
>> >>>>> rank. 
>> >>>>> 
>> >>>>> - "302 Found" redirect is the recommended replacement for 303 for
>> >>>>> clients that do not support HTTP 1.1 and it does allow a certain
>> >>>>> amount of 
>> >>>>> google page rank to flow.
>> >>>>> - "301 Moved Permanently" is a poor fit for the Cool URI pattern,
>> >>>>>but 
>> >>>>> passes on the full page rank of the links.
>> >>>>> - rewriting all URIs the URL would also work, but would break the
>> >>>>> coolURI pattern.
>> >>>>> 
>> >>>>> The pragmatist in me feels that if we are going to make a change
>>for 
>> >>>>>the 
>> >>>>> purposes of SEO, it might as well be the one with best return,
>>i.e. 
>> >>>>>301 
>> >>>>> redirect. 
>> >>>>> 
>> >>>>> Note: Indexing is not the problem here, content is indexed. The
>> >>>>>issue 
>> >>>>> relates to page rank not flowing through a 303 redirect.
>> >>>>> 
>> >>>>> I have tested and can confirm that 303 redirects are an issue for
>>a 
>> >>>>> number 
>> >>>>> of reasons:
>> >>>>> 
>> >>>>> - page rank does not flow through a 303 redirect
>> >>>>> - page rank can not be assigned from a url to a uri with a
>> >>>>> rel=canonical
>> >>>>> tag if URI does a 303 redirect (preventing aggregation of
>>pagerank 
>> >>>>> from 
>> >>>>> external links to URL)
>> >>>>> - URI and URL are indexed separately
>> >>>>> - rdfa schema.org representations of URIs do not translate to URL
>> >>>>> (ie. 
>> >>>>> representation described at URL A, talking about URI B, does not
>> >>>>>get 
>> >>>>> connected to representation described at URL B)
>> >>>>> - url parameters are not passed by a 303 redirect.
>> >>>>> - impact on functinality of google analytics tracking eg.
>> >>>>>traversing 
>> >>>>> the 
>> >>>>> site is seen as a series of direct page visits.
>> >>>>> 
>> >>>>> Essentially - as far as search engines are concerned - every URL
>>and 
>> >>>>> URI is 
>> >>>>> an island, with no connections between them. At best a URL can
>> >>>>>express 
>> >>>>> a 
>> >>>>> rel=canonical back to it's corresponding URI, no pagerank will
>>flow 
>> >>>>> through 
>> >>>>> links. 
>> >>>>> 
>> >>>>> Any guidance you can provide would be appreciated.
>> >>>>> 
>> >>>>> -- 
>> >>>>> 
>> >>>>> 
>> 
>>>>>>>o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>=- 
>> >>>>> | Mark Fallu
>> >>>>> | Manager, Research Data (Acting)
>> >>>>> | Office for Research
>> >>>>> | Bray Centre (N54) 0.10E
>> >>>>> | Griffith University, Nathan Campus
>> >>>>> | Queensland 4111 AUSTRALIA
>> >>>>> | 
>> >>>>> | E-mail: m.fallu@griffith.edu.au
>> >>>>> | Mobile: 04177 69778
>> >>>>> | Phone: +61 (07) 373 52069
>> >>>>> 
>> 
>>>>>>>o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>=- 
>> >>>> 
>> >>>> -- 
>> >>>> ++ Michael Brunnbauer
>> >>>> ++ netEstate GmbH
>> >>>> ++ Geisenhausener Straße 11a
>> >>>> ++ 81379 München
>> >>>> ++ Tel +49 89 32 19 77 80
>> >>>> ++ Fax +49 89 32 19 77 89
>> >>>> ++ E-Mail brunni@netestate.de
>> >>>> ++ http://www.netestate.de/
>> >>>> ++ 
>> >>>> ++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
>> >>>> ++ USt-IdNr. DE221033342
>> >>>> ++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>> >>>> ++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
>> >>> 
>> >>> 
>> >> 
>> > 
>> 
>>
Received on Wednesday, 23 July 2014 18:05:41 UTC