- From: Michael Smethurst <michael.smethurst@bbc.co.uk>
- Date: Wed, 23 Jul 2014 13:50:58 +0000
- To: Bill Roberts <bill@swirrl.com>
- CC: "public-lod@w3.org" <public-lod@w3.org>
Hi Bill Bit of a difficult question to answer because the reality is probably still quite disjoined. Various parts of bbc.co.uk: - serve linked data - store data as rdf (in a triple store) - consume (to some extent) linked data But nowhere are all those things true in one place. So /programmes publishes linked data but the backend is a relational database, whereas things like sport / olympics are stored as linked data but don't publish So the 2 parts aren't really coupled I do half remember lots of conversations about hashes v slashes for /programmes and /music but the sites are designed to be quite granular (one thing per uri; one uri per thing) so we weren't really dealing with lots of things in a document The linked data platform (our triple store) does use # uris like: On 23/07/2014 14:19, "Bill Roberts" <bill@swirrl.com> wrote: >Hi Michael > >We've tended to use slash URIs where possible, because have found it more >convenient when doing URI dereferencing from a triple-store backed site - >in which case we essentially do a DESCRIBE on the relevant URI. >(So we do 303ing for non-information resources, though in practice in a >lot of our applications, the great majority of content is statistical >data, which we treat as information resources and respond with 200). > >How do you organise your data and generation of URI dereferencing >responses with hash based URIs? I can see a variety of ways to do it, >but I'd be interested to know what you have found most >efficient/convenient at the BBC - essentially dealing with the fact that >the server doesn't know about what comes after the # > > >Thanks > >Bill > >On 23 Jul 2014, at 13:52, Michael Smethurst <michael.smethurst@bbc.co.uk> >wrote: > >> Hello >> >> (Pretty sure I've made this comment before so please forgive any signs >>of >> premature senility) >> >> I think this may be an unfortunate side effect of the conflation of the >> 303 ("I can't send that") pattern with the content negotiation ("what >> flavour would you like") pattern >> >> Lots of linked data applications (like dbpedia) seem to couple the two >> things together. So you have a "individual" uri which, when you attempt >>to >> dereference does a 303 *and* conneg in one step to the "display" uri: >> /resource > 303+conneg > /data >> or >> /resource > 303+conneg > /page >> >> >> Many other linked data sites seem to have followed this pattern but it >> does seem, to my eyes, broke >> >> At the BBC we have 3 flavours of uri. I'm not sure if these are the >> appropriate / best labels but: >> - the non-information resource uri. The uri that refers to the real >>world >> physical / metaphysical thing >> - the generic information resource uri that identifies the document but >> not any specific representation of the document >> - the representation uri (the html or json or rdf-xml etc) >> >> We tend to use hashes rather than slashes like >> http://www.bbc.co.uk/programmes/b006mw1h#programme >> >> >> But pretending we use slashes for a minute... >> >> If you requested: >> http://www.bbc.co.uk/programmes/b006mw1h/thing >> >> >> You'd get a 303 redirect to the generic document / information resource >> uri: >> http://www.bbc.co.uk/programmes/b006mw1h >> >> >> Which would then conneg to the appropriate representation which would >> still be served from: >> http://www.bbc.co.uk/programmes/b006mw1h >> >> With a content location header of >> http://www.bbc.co.uk/programmes/b006mw1h.rdf >> >> For example >> >> Whilst the rdf refers to the non-information resource uri when making >> assertions about the "thing" this uri is not used elsewhere. All links >>in >> the html point to the generic document uri not to the non-information >> resource uri >> >> So crawlers like google just follow links from information resource to >> information resource and never have to encounter 303s >> >> Picking up a conneg penalty for every request isn't without problems >> (particularly given CDN serving) but picking up a 303 penalty for every >> request would be madness and not something we'd ever have been able to >> implement >> >> I do think the dbpedia conflation of 303 with conneg is an unhelpful >> anti-pattern that people shouldn't be encouraged to follow. The conneg >> part is just REST; "semantics" add the 303 onto that but they're not >>doing >> the same thing >> >> Separating 303 from conneg still gives you "thing" vs document >>separation, >> still maintains cool uris and doesn't kill your servers >> >> And we've never had a problem with seo >> >> Hth >> michael >> >> >> >> >> On 18/07/2014 16:52, "Michael Brunnbauer" <brunni@netestate.de> wrote: >> >>> >>> Hello Mark, >>> >>> I cannot remember this important topic coming up earlier - which is a >>>bit >>> disturbing. >>> >>> The problem would be migitated by people using the URI they see for >>> linking. >>> >>> Why not use the HTML URLs in the HTML pages for internal page rank >>>flow? >>> >>> How can URIs from sparql endpoints or OAI-PMH contribute to page rank? >>> >>> A real problem would be RDFa where href also sets the object of a >>>triple. >>> >>> Regards, >>> >>> Michael Brunnbauer >>> >>> On Fri, Jul 18, 2014 at 10:05:17PM +1000, Mark Fallu wrote: >>>> If the links we present to the outside world for harvesting eg. via >>>> sparql >>>> endpoint, OAI-PMH or open social widget etc is the canonical >>>> "individual" >>>> URI, clients will be able to get to the "display" url, but the google >>>> page >>>> rank that would normally flow from these external links will not. >>> >>> >>> >>>> >>>> The specification of a 303 redirect describes it as: >>>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html >>>> >>>>> "The response to the request can be found under a different URI and >>>> SHOULD >>>>> be retrieved using a GET method on that resource. This method exists >>>>> primarily to allow the output of a POST-activated script to redirect >>>> the >>>>> user agent to a selected resource. *The new URI is not a substitute >>>>> reference for the originally requested resource*. The 303 response >>>> MUST >>>>> NOT be cached, but the response to the second (redirected) request >>>> might be >>>>> cacheable. >>>>> >>>> >>>> >>>> The different URI SHOULD be given by the Location field in the >>>>response. >>>>> Unless the request method was HEAD, the entity of the response SHOULD >>>>> contain a short hypertext note with a hyperlink to the new URI(s)." >>>> >>>> >>>> Google correctly implements the specification and does not assign the >>>> page >>>> rank of the "individual" URI to the "display" URL as it is "*not a >>>> substitute reference for the originally requested resource".* >>>> >>>> The same is true of internal links, a high page rank home page will >>>>not >>>> pass page rank on to "display" urls if the pathway to those urls is >>>>via >>>> "individual" uri links. >>>> >>>> I am not sure what the solution is here as it seems the realms of SEO >>>> and >>>> the conventions of the web they are built on are not a good fit for >>>> semantic web best practice. >>>> >>>> The most minimal compromise I can think of is to move away from the >>>>use >>>> of >>>> a 303 redirect to a redirect that conserves the flow of google page >>>> rank. >>>> >>>> - "302 Found" redirect is the recommended replacement for 303 for >>>> clients that do not support HTTP 1.1 and it does allow a certain >>>> amount of >>>> google page rank to flow. >>>> - "301 Moved Permanently" is a poor fit for the Cool URI pattern, >>>>but >>>> passes on the full page rank of the links. >>>> - rewriting all URIs the URL would also work, but would break the >>>> coolURI pattern. >>>> >>>> The pragmatist in me feels that if we are going to make a change for >>>>the >>>> purposes of SEO, it might as well be the one with best return, i.e. >>>>301 >>>> redirect. >>>> >>>> Note: Indexing is not the problem here, content is indexed. The issue >>>> relates to page rank not flowing through a 303 redirect. >>>> >>>> I have tested and can confirm that 303 redirects are an issue for a >>>> number >>>> of reasons: >>>> >>>> - page rank does not flow through a 303 redirect >>>> - page rank can not be assigned from a url to a uri with a >>>> rel=canonical >>>> tag if URI does a 303 redirect (preventing aggregation of pagerank >>>> from >>>> external links to URL) >>>> - URI and URL are indexed separately >>>> - rdfa schema.org representations of URIs do not translate to URL >>>> (ie. >>>> representation described at URL A, talking about URI B, does not get >>>> connected to representation described at URL B) >>>> - url parameters are not passed by a 303 redirect. >>>> - impact on functinality of google analytics tracking eg. traversing >>>> the >>>> site is seen as a series of direct page visits. >>>> >>>> Essentially - as far as search engines are concerned - every URL and >>>> URI is >>>> an island, with no connections between them. At best a URL can >>>>express >>>> a >>>> rel=canonical back to it's corresponding URI, no pagerank will flow >>>> through >>>> links. >>>> >>>> Any guidance you can provide would be appreciated. >>>> >>>> -- >>>> >>>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>> | Mark Fallu >>>> | Manager, Research Data (Acting) >>>> | Office for Research >>>> | Bray Centre (N54) 0.10E >>>> | Griffith University, Nathan Campus >>>> | Queensland 4111 AUSTRALIA >>>> | >>>> | E-mail: m.fallu@griffith.edu.au >>>> | Mobile: 04177 69778 >>>> | Phone: +61 (07) 373 52069 >>>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>> >>> -- >>> ++ Michael Brunnbauer >>> ++ netEstate GmbH >>> ++ Geisenhausener Straße 11a >>> ++ 81379 München >>> ++ Tel +49 89 32 19 77 80 >>> ++ Fax +49 89 32 19 77 89 >>> ++ E-Mail brunni@netestate.de >>> ++ http://www.netestate.de/ >>> ++ >>> ++ Sitz: München, HRB Nr.142452 (Handelsregister B München) >>> ++ USt-IdNr. DE221033342 >>> ++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer >>> ++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel >> >> >
Received on Wednesday, 23 July 2014 13:51:31 UTC