- From: Michael Smethurst <michael.smethurst@bbc.co.uk>
- Date: Wed, 23 Jul 2014 13:50:58 +0000
- To: Bill Roberts <bill@swirrl.com>
- CC: "public-lod@w3.org" <public-lod@w3.org>
Hi Bill
Bit of a difficult question to answer because the reality is probably
still quite disjoined. Various parts of bbc.co.uk:
- serve linked data
- store data as rdf (in a triple store)
- consume (to some extent) linked data
But nowhere are all those things true in one place. So /programmes
publishes linked data but the backend is a relational database, whereas
things like sport / olympics are stored as linked data but don't publish
So the 2 parts aren't really coupled
I do half remember lots of conversations about hashes v slashes for
/programmes and /music but the sites are designed to be quite granular
(one thing per uri; one uri per thing) so we weren't really dealing with
lots of things in a document
The linked data platform (our triple store) does use # uris like:
On 23/07/2014 14:19, "Bill Roberts" <bill@swirrl.com> wrote:
>Hi Michael
>
>We've tended to use slash URIs where possible, because have found it more
>convenient when doing URI dereferencing from a triple-store backed site -
>in which case we essentially do a DESCRIBE on the relevant URI.
>(So we do 303ing for non-information resources, though in practice in a
>lot of our applications, the great majority of content is statistical
>data, which we treat as information resources and respond with 200).
>
>How do you organise your data and generation of URI dereferencing
>responses with hash based URIs? I can see a variety of ways to do it,
>but I'd be interested to know what you have found most
>efficient/convenient at the BBC - essentially dealing with the fact that
>the server doesn't know about what comes after the #
>
>
>Thanks
>
>Bill
>
>On 23 Jul 2014, at 13:52, Michael Smethurst <michael.smethurst@bbc.co.uk>
>wrote:
>
>> Hello
>>
>> (Pretty sure I've made this comment before so please forgive any signs
>>of
>> premature senility)
>>
>> I think this may be an unfortunate side effect of the conflation of the
>> 303 ("I can't send that") pattern with the content negotiation ("what
>> flavour would you like") pattern
>>
>> Lots of linked data applications (like dbpedia) seem to couple the two
>> things together. So you have a "individual" uri which, when you attempt
>>to
>> dereference does a 303 *and* conneg in one step to the "display" uri:
>> /resource > 303+conneg > /data
>> or
>> /resource > 303+conneg > /page
>>
>>
>> Many other linked data sites seem to have followed this pattern but it
>> does seem, to my eyes, broke
>>
>> At the BBC we have 3 flavours of uri. I'm not sure if these are the
>> appropriate / best labels but:
>> - the non-information resource uri. The uri that refers to the real
>>world
>> physical / metaphysical thing
>> - the generic information resource uri that identifies the document but
>> not any specific representation of the document
>> - the representation uri (the html or json or rdf-xml etc)
>>
>> We tend to use hashes rather than slashes like
>> http://www.bbc.co.uk/programmes/b006mw1h#programme
>>
>>
>> But pretending we use slashes for a minute...
>>
>> If you requested:
>> http://www.bbc.co.uk/programmes/b006mw1h/thing
>>
>>
>> You'd get a 303 redirect to the generic document / information resource
>> uri:
>> http://www.bbc.co.uk/programmes/b006mw1h
>>
>>
>> Which would then conneg to the appropriate representation which would
>> still be served from:
>> http://www.bbc.co.uk/programmes/b006mw1h
>>
>> With a content location header of
>> http://www.bbc.co.uk/programmes/b006mw1h.rdf
>>
>> For example
>>
>> Whilst the rdf refers to the non-information resource uri when making
>> assertions about the "thing" this uri is not used elsewhere. All links
>>in
>> the html point to the generic document uri not to the non-information
>> resource uri
>>
>> So crawlers like google just follow links from information resource to
>> information resource and never have to encounter 303s
>>
>> Picking up a conneg penalty for every request isn't without problems
>> (particularly given CDN serving) but picking up a 303 penalty for every
>> request would be madness and not something we'd ever have been able to
>> implement
>>
>> I do think the dbpedia conflation of 303 with conneg is an unhelpful
>> anti-pattern that people shouldn't be encouraged to follow. The conneg
>> part is just REST; "semantics" add the 303 onto that but they're not
>>doing
>> the same thing
>>
>> Separating 303 from conneg still gives you "thing" vs document
>>separation,
>> still maintains cool uris and doesn't kill your servers
>>
>> And we've never had a problem with seo
>>
>> Hth
>> michael
>>
>>
>>
>>
>> On 18/07/2014 16:52, "Michael Brunnbauer" <brunni@netestate.de> wrote:
>>
>>>
>>> Hello Mark,
>>>
>>> I cannot remember this important topic coming up earlier - which is a
>>>bit
>>> disturbing.
>>>
>>> The problem would be migitated by people using the URI they see for
>>> linking.
>>>
>>> Why not use the HTML URLs in the HTML pages for internal page rank
>>>flow?
>>>
>>> How can URIs from sparql endpoints or OAI-PMH contribute to page rank?
>>>
>>> A real problem would be RDFa where href also sets the object of a
>>>triple.
>>>
>>> Regards,
>>>
>>> Michael Brunnbauer
>>>
>>> On Fri, Jul 18, 2014 at 10:05:17PM +1000, Mark Fallu wrote:
>>>> If the links we present to the outside world for harvesting eg. via
>>>> sparql
>>>> endpoint, OAI-PMH or open social widget etc is the canonical
>>>> "individual"
>>>> URI, clients will be able to get to the "display" url, but the google
>>>> page
>>>> rank that would normally flow from these external links will not.
>>>
>>>
>>>
>>>>
>>>> The specification of a 303 redirect describes it as:
>>>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
>>>>
>>>>> "The response to the request can be found under a different URI and
>>>> SHOULD
>>>>> be retrieved using a GET method on that resource. This method exists
>>>>> primarily to allow the output of a POST-activated script to redirect
>>>> the
>>>>> user agent to a selected resource. *The new URI is not a substitute
>>>>> reference for the originally requested resource*. The 303 response
>>>> MUST
>>>>> NOT be cached, but the response to the second (redirected) request
>>>> might be
>>>>> cacheable.
>>>>>
>>>>
>>>>
>>>> The different URI SHOULD be given by the Location field in the
>>>>response.
>>>>> Unless the request method was HEAD, the entity of the response SHOULD
>>>>> contain a short hypertext note with a hyperlink to the new URI(s)."
>>>>
>>>>
>>>> Google correctly implements the specification and does not assign the
>>>> page
>>>> rank of the "individual" URI to the "display" URL as it is "*not a
>>>> substitute reference for the originally requested resource".*
>>>>
>>>> The same is true of internal links, a high page rank home page will
>>>>not
>>>> pass page rank on to "display" urls if the pathway to those urls is
>>>>via
>>>> "individual" uri links.
>>>>
>>>> I am not sure what the solution is here as it seems the realms of SEO
>>>> and
>>>> the conventions of the web they are built on are not a good fit for
>>>> semantic web best practice.
>>>>
>>>> The most minimal compromise I can think of is to move away from the
>>>>use
>>>> of
>>>> a 303 redirect to a redirect that conserves the flow of google page
>>>> rank.
>>>>
>>>> - "302 Found" redirect is the recommended replacement for 303 for
>>>> clients that do not support HTTP 1.1 and it does allow a certain
>>>> amount of
>>>> google page rank to flow.
>>>> - "301 Moved Permanently" is a poor fit for the Cool URI pattern,
>>>>but
>>>> passes on the full page rank of the links.
>>>> - rewriting all URIs the URL would also work, but would break the
>>>> coolURI pattern.
>>>>
>>>> The pragmatist in me feels that if we are going to make a change for
>>>>the
>>>> purposes of SEO, it might as well be the one with best return, i.e.
>>>>301
>>>> redirect.
>>>>
>>>> Note: Indexing is not the problem here, content is indexed. The issue
>>>> relates to page rank not flowing through a 303 redirect.
>>>>
>>>> I have tested and can confirm that 303 redirects are an issue for a
>>>> number
>>>> of reasons:
>>>>
>>>> - page rank does not flow through a 303 redirect
>>>> - page rank can not be assigned from a url to a uri with a
>>>> rel=canonical
>>>> tag if URI does a 303 redirect (preventing aggregation of pagerank
>>>> from
>>>> external links to URL)
>>>> - URI and URL are indexed separately
>>>> - rdfa schema.org representations of URIs do not translate to URL
>>>> (ie.
>>>> representation described at URL A, talking about URI B, does not get
>>>> connected to representation described at URL B)
>>>> - url parameters are not passed by a 303 redirect.
>>>> - impact on functinality of google analytics tracking eg. traversing
>>>> the
>>>> site is seen as a series of direct page visits.
>>>>
>>>> Essentially - as far as search engines are concerned - every URL and
>>>> URI is
>>>> an island, with no connections between them. At best a URL can
>>>>express
>>>> a
>>>> rel=canonical back to it's corresponding URI, no pagerank will flow
>>>> through
>>>> links.
>>>>
>>>> Any guidance you can provide would be appreciated.
>>>>
>>>> --
>>>>
>>>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>> | Mark Fallu
>>>> | Manager, Research Data (Acting)
>>>> | Office for Research
>>>> | Bray Centre (N54) 0.10E
>>>> | Griffith University, Nathan Campus
>>>> | Queensland 4111 AUSTRALIA
>>>> |
>>>> | E-mail: m.fallu@griffith.edu.au
>>>> | Mobile: 04177 69778
>>>> | Phone: +61 (07) 373 52069
>>>> o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>
>>> --
>>> ++ Michael Brunnbauer
>>> ++ netEstate GmbH
>>> ++ Geisenhausener Straße 11a
>>> ++ 81379 München
>>> ++ Tel +49 89 32 19 77 80
>>> ++ Fax +49 89 32 19 77 89
>>> ++ E-Mail brunni@netestate.de
>>> ++ http://www.netestate.de/
>>> ++
>>> ++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
>>> ++ USt-IdNr. DE221033342
>>> ++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>>> ++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
>>
>>
>
Received on Wednesday, 23 July 2014 13:51:31 UTC