Re: Linked Data and Semantic Web CoolURIs, 303 redirects and Page Rank. from john.walker on 2014-07-23 (public-lod@w3.org from July 2014)

From: john.walker <john.walker@semaku.com>
Date: Wed, 23 Jul 2014 16:50:15 +0200 (CEST)
To: Michael Smethurst <michael.smethurst@bbc.co.uk>
Cc: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <1243004525.980714.1406127015892.open-xchange@oxweb03.eigbox.net>
Hi Michael,

Hope the laptop is ok :)

So I can think of your 'slash' NIR URI as something similar to a URN:
http://www.bbc.co.uk/programmes/b006mw1h/thing

It doesn't do much on it's own and *just* acts as an identifier.
Using HTTP it can be resolved to a URL via the 303, kind of similar to a URN
resolver.

Could you explain what you mean by "conneg penalty"?

I've set up an application working with 303s and, although I don't consider
myself mad, it does add an extra request to every click the user does.
Getting the 303 response takes 20 - 25 ms on average, so it's not a big issue in
this case (internal company usage).

Interestingly enough I just checked a random shortened link off Twitter and it
went through no less than 5 HTTP 301/302 redirects (500 ms in total) before
getting the HTML.
Taking that into consideration a single 303 is not too bad!

Regards,

John Walker


> On July 23, 2014 at 3:55 PM Michael Smethurst <michael.smethurst@bbc.co.uk>
> wrote:
>
>
> Oops, dropped laptop :-/
>
> Continues....
>
> On 23/07/2014 14:50, "Michael Smethurst" <michael.smethurst@bbc.co.uk>
> wrote:
>
> >Hi Bill
> >
> >Bit of a difficult question to answer because the reality is probably
> >still quite disjointed. Various parts of bbc.co.uk:
> >- serve linked data
> >- store data as rdf (in a triple store)
> >- consume (to some extent) linked data
> >
> >But nowhere are all those things true in one place. So /programmes
> >publishes linked data but the backend is a relational database, whereas
> >things like sport / olympics are stored as linked data but don't publish
> >
> >So the 2 parts aren't really coupled
> >
> >I do half remember lots of conversations about hashes v slashes for
> >/programmes and /music but the sites are designed to be quite granular
> >(one thing per uri; one uri per thing) so we weren't really dealing with
> >lots of things in a document
> >
> >The linked data platform (our triple store) does use # uris like:
> http://www.bbc.co.uk/things/794274f1-d7ea-4ad2-9b36-c46ed55da9bd#id
>
>
> But I'm not best placed to know about the interfaces and queries onto this
> and why they chose hashes and not slashes. I'll ask around unless those
> people are already on this list...
>
> Not much help
> Sorry
> michael
> >
> >On 23/07/2014 14:19, "Bill Roberts" <bill@swirrl.com> wrote:
> >
> >>Hi Michael
> >>
> >>We've tended to use slash URIs where possible, because have found it more
> >>convenient when doing URI dereferencing from a triple-store backed site -
> >>in which case we essentially do a DESCRIBE on the relevant URI.
> >>(So we do 303ing for non-information resources, though in practice in a
> >>lot of our applications, the great majority of content is statistical
> >>data, which we treat as information resources and respond with 200).
> >>
> >>How do you organise your data and generation of URI dereferencing
> >>responses with hash based URIs? I can see a variety of ways to do it,
> >>but I'd be interested to know what you have found most
> >>efficient/convenient at the BBC - essentially dealing with the fact that
> >>the server doesn't know about what comes after the #
> >>
> >>
> >>Thanks
> >>
> >>Bill
> >>
> >>On 23 Jul 2014, at 13:52, Michael Smethurst <michael.smethurst@bbc.co.uk>
> >>wrote:
> >>
> >>> Hello
> >>>
> >>> (Pretty sure I've made this comment before so please forgive any signs
> >>>of
> >>> premature senility)
> >>>
> >>> I think this may be an unfortunate side effect of the conflation of the
> >>> 303 ("I can't send that") pattern with the content negotiation ("what
> >>> flavour would you like") pattern
> >>>
> >>> Lots of linked data applications (like dbpedia) seem to couple the two
> >>> things together. So you have a "individual" uri which, when you attempt
> >>>to
> >>> dereference does a 303 *and* conneg in one step to the "display" uri:
> >>> /resource > 303+conneg > /data
> >>> or
> >>> /resource > 303+conneg > /page
> >>>
> >>>
> >>> Many other linked data sites seem to have followed this pattern but it
> >>> does seem, to my eyes, broke
> >>>
> >>> At the BBC we have 3 flavours of uri. I'm not sure if these are the
> >>> appropriate / best labels but:
> >>> - the non-information resource uri. The uri that refers to the real
> >>>world
> >>> physical / metaphysical thing
> >>> - the generic information resource uri that identifies the document but
> >>> not any specific representation of the document
> >>> - the representation uri (the html or json or rdf-xml etc)
> >>>
> >>> We tend to use hashes rather than slashes like
> >>> http://www.bbc.co.uk/programmes/b006mw1h#programme
> >>>
> >>>
> >>> But pretending we use slashes for a minute...
> >>>
> >>> If you requested:
> >>> http://www.bbc.co.uk/programmes/b006mw1h/thing
> >>>
> >>>
> >>> You'd get a 303 redirect to the generic document / information resource
> >>> uri:
> >>> http://www.bbc.co.uk/programmes/b006mw1h
> >>>
> >>>
> >>> Which would then conneg to the appropriate representation which would
> >>> still be served from:
> >>> http://www.bbc.co.uk/programmes/b006mw1h
> >>>
> >>> With a content location header of
> >>> http://www.bbc.co.uk/programmes/b006mw1h.rdf
> >>>
> >>> For example
> >>>
> >>> Whilst the rdf refers to the non-information resource uri when making
> >>> assertions about the "thing" this uri is not used elsewhere. All links
> >>>in
> >>> the html point to the generic document uri not to the non-information
> >>> resource uri
> >>>
> >>> So crawlers like google just follow links from information resource to
> >>> information resource and never have to encounter 303s
> >>>
> >>> Picking up a conneg penalty for every request isn't without problems
> >>> (particularly given CDN serving) but picking up a 303 penalty for every
> >>> request would be madness and not something we'd ever have been able to
> >>> implement
> >>>
> >>> I do think the dbpedia conflation of 303 with conneg is an unhelpful
> >>> anti-pattern that people shouldn't be encouraged to follow. The conneg
> >>> part is just REST; "semantics" add the 303 onto that but they're not
> >>>doing
> >>> the same thing
> >>>
> >>> Separating 303 from conneg still gives you "thing" vs document
> >>>separation,
> >>> still maintains cool uris and doesn't kill your servers
> >>>
> >>> And we've never had a problem with seo
> >>>
> >>> Hth
> >>> michael
> >>>
> >>>
> >>>
> >>>
> >>> On 18/07/2014 16:52, "Michael Brunnbauer" <brunni@netestate.de> wrote:
> >>>
> >>>>
> >>>> Hello Mark,
> >>>>
> >>>> I cannot remember this important topic coming up earlier - which is a
> >>>>bit
> >>>> disturbing.
> >>>>
> >>>> The problem would be migitated by people using the URI they see for
> >>>> linking.
> >>>>
> >>>> Why not use the HTML URLs in the HTML pages for internal page rank
> >>>>flow?
> >>>>
> >>>> How can URIs from sparql endpoints or OAI-PMH contribute to page rank?
> >>>>
> >>>> A real problem would be RDFa where href also sets the object of a
> >>>>triple.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Michael Brunnbauer
> >>>>
> >>>> On Fri, Jul 18, 2014 at 10:05:17PM +1000, Mark Fallu wrote:
> >>>>> If the links we present to the outside world for harvesting eg. via
> >>>>> sparql
> >>>>> endpoint, OAI-PMH or open social widget etc is the canonical
> >>>>> "individual"
> >>>>> URI, clients will be able to get to the "display" url, but the google
> >>>>> page
> >>>>> rank that would normally flow from these external links will not.
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> The specification of a 303 redirect describes it as:
> >>>>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
> >>>>>
> >>>>>> "The response to the request can be found under a different URI and
> >>>>> SHOULD
> >>>>>> be retrieved using a GET method on that resource. This method exists
> >>>>>> primarily to allow the output of a POST-activated script to redirect
> >>>>> the
> >>>>>> user agent to a selected resource. *The new URI is not a substitute
> >>>>>> reference for the originally requested resource*. The 303 response
> >>>>> MUST
> >>>>>> NOT be cached, but the response to the second (redirected) request
> >>>>> might be
> >>>>>> cacheable.
> >>>>>>
> >>>>>
> >>>>>
> >>>>> The different URI SHOULD be given by the Location field in the
> >>>>>response.
> >>>>>> Unless the request method was HEAD, the entity of the response
> >>>>>>SHOULD
> >>>>>> contain a short hypertext note with a hyperlink to the new URI(s)."
> >>>>>
> >>>>>
> >>>>> Google correctly implements the specification and does not assign the
> >>>>> page
> >>>>> rank of the "individual" URI to the "display" URL as it is "*not a
> >>>>> substitute reference for the originally requested resource".*
> >>>>>
> >>>>> The same is true of internal links, a high page rank home page will
> >>>>>not
> >>>>> pass page rank on to "display" urls if the pathway to those urls is
> >>>>>via
> >>>>> "individual" uri links.
> >>>>>
> >>>>> I am not sure what the solution is here as it seems the realms of SEO
> >>>>> and
> >>>>> the conventions of the web they are built on are not a good fit for
> >>>>> semantic web best practice.
> >>>>>
> >>>>> The most minimal compromise I can think of is to move away from the
> >>>>>use
> >>>>> of
> >>>>> a 303 redirect to a redirect that conserves the flow of google page
> >>>>> rank.
> >>>>>
> >>>>> - "302 Found" redirect is the recommended replacement for 303 for
> >>>>> clients that do not support HTTP 1.1 and it does allow a certain
> >>>>> amount of
> >>>>> google page rank to flow.
> >>>>> - "301 Moved Permanently" is a poor fit for the Cool URI pattern,
> >>>>>but
> >>>>> passes on the full page rank of the links.
> >>>>> - rewriting all URIs the URL would also work, but would break the
> >>>>> coolURI pattern.
> >>>>>
> >>>>> The pragmatist in me feels that if we are going to make a change for
> >>>>>the
> >>>>> purposes of SEO, it might as well be the one with best return, i.e.
> >>>>>301
> >>>>> redirect.
> >>>>>
> >>>>> Note: Indexing is not the problem here, content is indexed. The
> >>>>>issue
> >>>>> relates to page rank not flowing through a 303 redirect.
> >>>>>
> >>>>> I have tested and can confirm that 303 redirects are an issue for a
> >>>>> number
> >>>>> of reasons:
> >>>>>
> >>>>> - page rank does not flow through a 303 redirect
> >>>>> - page rank can not be assigned from a url to a uri with a
> >>>>> rel=canonical
> >>>>> tag if URI does a 303 redirect (preventing aggregation of pagerank
> >>>>> from
> >>>>> external links to URL)
> >>>>> - URI and URL are indexed separately
> >>>>> - rdfa schema.org representations of URIs do not translate to URL
> >>>>> (ie.
> >>>>> representation described at URL A, talking about URI B, does not
> >>>>>get
> >>>>> connected to representation described at URL B)
> >>>>> - url parameters are not passed by a 303 redirect.
> >>>>> - impact on functinality of google analytics tracking eg.
> >>>>>traversing
> >>>>> the
> >>>>> site is seen as a series of direct page visits.
> >>>>>
> >>>>> Essentially - as far as search engines are concerned - every URL and
> >>>>> URI is
> >>>>> an island, with no connections between them. At best a URL can
> >>>>>express
> >>>>> a
> >>>>> rel=canonical back to it's corresponding URI, no pagerank will flow
> >>>>> through
> >>>>> links.
> >>>>>
> >>>>> Any guidance you can provide would be appreciated.
> >>>>>
> >>>>> --
> >>>>>
> >>>>>
> >>>>>o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> >>>>> | Mark Fallu
> >>>>> | Manager, Research Data (Acting)
> >>>>> | Office for Research
> >>>>> | Bray Centre (N54) 0.10E
> >>>>> | Griffith University, Nathan Campus
> >>>>> | Queensland 4111 AUSTRALIA
> >>>>> |
> >>>>> | E-mail: m.fallu@griffith.edu.au
> >>>>> | Mobile: 04177 69778
> >>>>> | Phone: +61 (07) 373 52069
> >>>>>
> >>>>>o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> >>>>
> >>>> --
> >>>> ++ Michael Brunnbauer
> >>>> ++ netEstate GmbH
> >>>> ++ Geisenhausener Straße 11a
> >>>> ++ 81379 München
> >>>> ++ Tel +49 89 32 19 77 80
> >>>> ++ Fax +49 89 32 19 77 89
> >>>> ++ E-Mail brunni@netestate.de
> >>>> ++ http://www.netestate.de/
> >>>> ++
> >>>> ++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
> >>>> ++ USt-IdNr. DE221033342
> >>>> ++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
> >>>> ++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
> >>>
> >>>
> >>
> >
>
>
Received on Wednesday, 23 July 2014 14:50:38 UTC