Re: Data Driven Discussions about httpRange-14, etc (was: Re: Change Proposal for HttpRange-14) from Hugh Glaser on 2012-03-30 (public-lod@w3.org from March 2012)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Fri, 30 Mar 2012 17:11:57 +0000
To: "<tom.heath@talis.com>" <tom.heath@talis.com>
CC: Jeni Tennison <jeni@jenitennison.com>, public-lod community <public-lod@w3.org>
Message-ID: <EMEW3|82ab668bc042844e36cc5364ccd217c6o2YIBx02hg|ecs.soton.ac.uk|99A49540-A327->
I think this may be a stuck record, but here goes…
Would be nice, Tom, but.
Yet again, the discussion around this issue is entirely focussed on
a) aspects of logic, philosophy and like-minded;
b) aspects of the problems of publishing;
c) network issues.

So where is the consumption aspect?
The measure by which we decide if all this engineering is "fit for purpose".
Design all the protocols you want, but if you are not examining the right thing, it is not very helpful (to put it mildly).
David Booth (sorry David!) said we need to deal with the engineering before addressing how we can educate people to understand it.
And therefore, I would say, whether it is possible.
This is not a recipe for building stuff that people can use.
In fact it is not engineering at all.

What is the definition of "fit for purpose" that you propose to use to define your protocols?
My definition requires that it is suitable input for building real applications that ordinary people can use, informed by multiple and even unbounded sites from the Web of Data. Clearly, at a decent scale as well.
This, I think, is the vision of Linked Data for many people.
(There is also an Agent point of view, but I think we are miles from that at the moment.)
If an argument cannot be made to support a point of view that has this as the end game, then I lose interest in the argument.

But let us say, you have a definition of "fit for purpose", and define your protocol to assess it for these questions.

Tom, you said to Michael
> From: Tom Heath <tom.heath@talis.com>
> Subject: Re: Change Proposal for HttpRange-14
> 
> Of all people you guys at the BBC have great anecdotes, and clearly


I found this really sad.
To my knowledge, Michael has never consumed much in the way of other people's Linked Data.
He has a fantastic wealth of knowledge about using Linked Data technologies to do Integration, which is a huge market for us.
Various people keep asking for examples of applications that might be interesting data points of Linked Data consuming end-user applications to inform the discussion.
But I have yet to see a satisfactory response.
But the sad truth is that I am beginning to think that after all these year I (RKBExplorer.com) may still be the only one who has actually built anything that consumes data from across the Linked Data Cloud, uses it to enhance the knowledge, and then deliver it to ordinary people (OK, we might fail, but we try).

This means that you, or others, just don't have enough data points to gather evidence for the assessment you want to do.

But please do try - I would love to see detailed analysis of fit for purpose - and I know how much time and effort that takes!
And yes, I am happy to provide you with any data I can.

Best
Hugh

On 30 Mar 2012, at 17:22, Tom Heath wrote:

> Hi Jeni,
> 
> On 27 March 2012 18:54, Jeni Tennison <jeni@jenitennison.com> wrote:
>> Hi Tom,
>> 
>> On 26 Mar 2012, at 17:13, Tom Heath wrote:
>>> On 26 March 2012 16:47, Jeni Tennison <jeni@jenitennison.com> wrote:
>>>> Tom,
>>>> 
>>>> On 26 Mar 2012, at 16:05, Tom Heath wrote:
>>>>> On 23 March 2012 15:35, Steve Harris <steve.harris@garlik.com> wrote:
>>>>>> I'm sure many people are just deeply bored of this discussion.
>>>>> 
>>>>> No offense intended to Jeni and others who are working hard on this,
>>>>> but *amen*, with bells on!
>>>>> 
>>>>> One of the things that bothers me most about the many years worth of
>>>>> httpRange-14 discussions (and the implications that HR14 is
>>>>> partly/heavily/solely to blame for slowing adoption of Linked Data) is
>>>>> the almost complete lack of hard data being used to inform the
>>>>> discussions. For a community populated heavily with scientists I find
>>>>> that pretty tragic.
>>>> 
>>>> 
>>>> What hard data do you think would resolve (or if not resolve, at least move forward) the argument? Some people > are contributing their own experience from building systems, but perhaps that's too anecdotal? Would a
>>>> structured survey be helpful? Or do you think we might be able to pick up trends from the webdatacommons.org > (or similar) data?
>>> 
>>> A few things come to mind:
>>> 
>>> 1) a rigorous assessment of how difficult people *really* find it to
>>> understand distinctions such as "things vs documents about things".
>>> I've heard many people claim that they've failed to explain this (or
>>> similar) successfully to developers/adopters; my personal experience
>>> is that everyone gets it, it's no big deal (and IRs/NIRs would
>>> probably never enter into the discussion).
> 
>> How would we assess that though?
> 
> Give me some free time and enough motivation and I'd design an
> experimental protocol to unpick this issue ;)
> 
>> My experience is in some way similar -- it's easy enough to explain that you can't get a Road or a Person when
>> you ask for them on the web -- but when you move on to then explaining how that means you need two URIs for > most of the things that you really want to talk about, and exactly how you have to support those URIs, it starts
>> getting much harder.
> 
> My original question was only about the distinction, but yes, some of
> the details do get tricky, but when was it ever otherwise with
> technology.
> 
>> The biggest indication to me that explaining the distinction is a problem is that neither OGP nor schema.org even > attempts to go near it when explaining to people how to add to semantic information into their web pages. The
>> URIs that you use in the 'url' properties of those vocabularies are explained in terms of 'canonical URLs' for the
>> thing that is being talked about. These are the kinds of graphs that millions of developers are building on, and
>> those developers do not consider themselves linked data adopters and will not be going to linked data experts for > training.
> 
> Yeah, this is a shame (the OGP/schema.org bit, and the fact they won't
> be asking for LD training ;). IIRC Ian Davis proposed a schema-level
> workaround for this around the time OGP was released. He had a good
> case that it was a non-problem technically, but no, that doesn't
> explain why the distinction is not baked into the data model; same
> with microformats.
> 
>>> 2) hard data about the 303 redirect penalty, from a consumer and
>>> publisher side. Lots of claims get made about this but I've never seen
>>> hard evidence of the cost of this; it may be trivial, we don't know in
>>> any reliable way. I've been considering writing a paper on this for
>>> the ISWC2012 Experiments and Evaluation track, but am short on spare
>>> time. If anyone wants to join me please shout.
>> 
>> I could offer you a data point from legislation.gov.uk if you like.
> 
> Woohoo! You've made my decade :D
> 
>> When someone requests the ToC for an item of
>> legislation, they will usually hit our CDN and the result will come back extremely quickly. I just tried:
>> 
>> curl --trace-time -v http://www.legislation.gov.uk/ukpga/1985/67/contents
>> 
>> and it showed the result coming back in 59ms.
>> 
>> When someone uses the identifier URI for the abstract concept of an item of legislation, there's no caching so the > request goes right back to the server. I just tried:
>> 
>> curl --trace-time -v http://www.legislation.gov.uk/id/ukpga/1985/67
>> 
>> and it showed the result coming back in 838ms, of course the redirection goes to the ToC above, so in total it
>> takes around 900ms to get back the data.
> 
> Brilliant. This is just the kind of analysis I'm talking about. Now we
> need to do similar across a bunch of services, connection speeds,
> locations, etc., and then compare it to typical response times across
> a representative sample of web sites. We use New Relic for this kind
> of thing, and the results are rather illuminating. <1ms response times
> makes you rather special IIRC. That's not to excuse sluggish sites,
> but just to put this in context.
> 
>> So every time that we refer to an item of legislation through its generic identifier rather than a direct link to its ToC > we are making the site seem about 15 times slower.
> 
> So now we're getting down to the crux of the question: does this
> outcome really matter?! 15x almost nothing is still almost nothing!
> 15x slower may offend our geek sensibilities, but probably doesn't
> matter in practice when the absolute numbers are so small.
> 
> To give another example, I just did some very ad-hoc tests on some
> URIs at a department of a well-known UK university, and the results
> were rather revealing! The total response time (get URI of NIR,
> receive 303 response, get URI of IR, receive 200 OK and resource
> representation back) took ~10s, of which ***over 90%*** was taken up
> by waiting for the page/IR about that NIR to be generated! (and that's
> with curl, not a browser, which may then pull in a bunch of external
> dependencies). In this kind of situation I think there are other,
> bigger issues to worry about than the <1s taken for a 303-based
> roundtrip!!
> 
>> What's more, it puts load on our servers which doesn't
>> happen when the data is cached; the more load, the slower the responses to other important things that are hard > to cache, such as free-text searching.
> 
> Again, we need data here that takes into account the big picture. What
> is the real impact of handling 303s compared to, say, hitting a
> database and processing the results on the server side? Given the
> complexity of web apps vs the complexity of server-side code to do
> 303s I wonder if this server load argument is a red herring. Only real
> data will tell us.
> 
>> The consequence of course is that for practical reasons we design the site not to use generic identifiers for items > of legislation unless we really can't avoid it and add redirections where we should technically be using 404s. The > impracticality of 303s has meant that we've had to compromise in other areas of the structure of the site.
>> 
>> This is just one data point of course, and it's possible that if we'd fudged the handling of the generic identifiers
>> (eg by not worrying about when they should return 404s or 300s and just always doing a regex mapping to a
>> guess of an equivalent document URI) we would have better performance from them, but that would also have
>> been a design compromise forced on us because of the impracticality of 303s. (In fact we made this precise
>> design compromise for the data.gov.uk linked data.)
>> 
>>> 3) hard data about occurrences of different patterns/anti-patterns; we
>>> need something more concrete/comprehensive than the list in the change
>>> proposal document.
>> 
>> Yes, it would be good for someone to spend a long time on the entire webdatacommons.org corpus in a rigorous rather than a couple of hours in an evening testing URIs based on sifting through a couple of the files by eye.
>> 
>>> 4) examples of cases where the use of anti-patterns has actually
>>> caused real problems for people, and I don't mean problems in
>>> principle; have planes fallen out of the sky, has anyone died? Does it
>>> really matter from a consumption perspective? The answer to this is
>>> probably not, which may indicate a larger problem of non-adoption.
>> 
>> I don't personally have any examples of this; that doesn't mean they don't exist.
>> 
>> 
>> Anyway, back to process. The TAG could try to pull the larger community into contributing evidence around these issues. There's already the AWWWSW Wiki at
>> 
>>  http://www.w3.org/wiki/AwwswHome
>> 
>> which gathers together lots and lots and lots of writing on the topic, a portion of which is based on experience and studies of existing sites, which we could structure around the questions above and add to.
>> 
>> Having thought about it though, I'm not sure whether any of these would actually succeed in furthering the argument, particularly as much of the data is likely to equivocal and not lead to a "given the evidence, we should do X" realisation.
>> 
>> Taking the third one above for example, studying webdatacommons.org, I think people who back the status quo
>> will point at the large amount of data that is being produced through blogging frameworks and the like and claim > that this shows that publishers are generally getting it right. People who want to see a change will argue that the > data that's there could have been published much more easily, and wonder how much more there would be, if
>> httpRange-14 weren't so hard to comprehend.
> 
> You're probably right. Guess it's down to the TAG to 'adjudicate'... :|
> 
>> Basically, I fear that we're just likely to end up arguing over the interpretation of any evidence we collect which
>> would leave us no further on. I don't mean to nix the idea, I'm just a little pessimistic.
> 
> I think we've already managed to bring a bit more clarity* to the
> discussion - thank you :)
> 
> Tom.
> 
> -- 
> Dr. Tom Heath
> Senior Research Scientist
> Talis Education Ltd.
> W: http://www.talisaspire.com/
> W: http://tomheath.com/
> 

-- 
Hugh Glaser,  
             Web and Internet Science
             Electronics and Computer Science,
             University of Southampton,
             Southampton SO17 1BJ
Work: +44 23 8059 3670, Fax: +44 23 8059 3045
Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652
http://www.ecs.soton.ac.uk/~hg/
Received on Friday, 30 March 2012 17:12:30 UTC