- From: Tom Heath <tom.heath@talis.com>
- Date: Fri, 30 Mar 2012 17:22:01 +0100
- To: Jeni Tennison <jeni@jenitennison.com>
- Cc: public-lod community <public-lod@w3.org>
Hi Jeni, On 27 March 2012 18:54, Jeni Tennison <jeni@jenitennison.com> wrote: > Hi Tom, > > On 26 Mar 2012, at 17:13, Tom Heath wrote: >> On 26 March 2012 16:47, Jeni Tennison <jeni@jenitennison.com> wrote: >>> Tom, >>> >>> On 26 Mar 2012, at 16:05, Tom Heath wrote: >>>> On 23 March 2012 15:35, Steve Harris <steve.harris@garlik.com> wrote: >>>>> I'm sure many people are just deeply bored of this discussion. >>>> >>>> No offense intended to Jeni and others who are working hard on this, >>>> but *amen*, with bells on! >>>> >>>> One of the things that bothers me most about the many years worth of >>>> httpRange-14 discussions (and the implications that HR14 is >>>> partly/heavily/solely to blame for slowing adoption of Linked Data) is >>>> the almost complete lack of hard data being used to inform the >>>> discussions. For a community populated heavily with scientists I find >>>> that pretty tragic. >>> >>> >>> What hard data do you think would resolve (or if not resolve, at least move forward) the argument? Some people > are contributing their own experience from building systems, but perhaps that's too anecdotal? Would a >>> structured survey be helpful? Or do you think we might be able to pick up trends from the webdatacommons.org > (or similar) data? >> >> A few things come to mind: >> >> 1) a rigorous assessment of how difficult people *really* find it to >> understand distinctions such as "things vs documents about things". >> I've heard many people claim that they've failed to explain this (or >> similar) successfully to developers/adopters; my personal experience >> is that everyone gets it, it's no big deal (and IRs/NIRs would >> probably never enter into the discussion). > How would we assess that though? Give me some free time and enough motivation and I'd design an experimental protocol to unpick this issue ;) > My experience is in some way similar -- it's easy enough to explain that you can't get a Road or a Person when > you ask for them on the web -- but when you move on to then explaining how that means you need two URIs for > most of the things that you really want to talk about, and exactly how you have to support those URIs, it starts > getting much harder. My original question was only about the distinction, but yes, some of the details do get tricky, but when was it ever otherwise with technology. > The biggest indication to me that explaining the distinction is a problem is that neither OGP nor schema.org even > attempts to go near it when explaining to people how to add to semantic information into their web pages. The > URIs that you use in the 'url' properties of those vocabularies are explained in terms of 'canonical URLs' for the > thing that is being talked about. These are the kinds of graphs that millions of developers are building on, and > those developers do not consider themselves linked data adopters and will not be going to linked data experts for > training. Yeah, this is a shame (the OGP/schema.org bit, and the fact they won't be asking for LD training ;). IIRC Ian Davis proposed a schema-level workaround for this around the time OGP was released. He had a good case that it was a non-problem technically, but no, that doesn't explain why the distinction is not baked into the data model; same with microformats. >> 2) hard data about the 303 redirect penalty, from a consumer and >> publisher side. Lots of claims get made about this but I've never seen >> hard evidence of the cost of this; it may be trivial, we don't know in >> any reliable way. I've been considering writing a paper on this for >> the ISWC2012 Experiments and Evaluation track, but am short on spare >> time. If anyone wants to join me please shout. > > I could offer you a data point from legislation.gov.uk if you like. Woohoo! You've made my decade :D > When someone requests the ToC for an item of > legislation, they will usually hit our CDN and the result will come back extremely quickly. I just tried: > > curl --trace-time -v http://www.legislation.gov.uk/ukpga/1985/67/contents > > and it showed the result coming back in 59ms. > > When someone uses the identifier URI for the abstract concept of an item of legislation, there's no caching so the > request goes right back to the server. I just tried: > > curl --trace-time -v http://www.legislation.gov.uk/id/ukpga/1985/67 > > and it showed the result coming back in 838ms, of course the redirection goes to the ToC above, so in total it > takes around 900ms to get back the data. Brilliant. This is just the kind of analysis I'm talking about. Now we need to do similar across a bunch of services, connection speeds, locations, etc., and then compare it to typical response times across a representative sample of web sites. We use New Relic for this kind of thing, and the results are rather illuminating. <1ms response times makes you rather special IIRC. That's not to excuse sluggish sites, but just to put this in context. > So every time that we refer to an item of legislation through its generic identifier rather than a direct link to its ToC > we are making the site seem about 15 times slower. So now we're getting down to the crux of the question: does this outcome really matter?! 15x almost nothing is still almost nothing! 15x slower may offend our geek sensibilities, but probably doesn't matter in practice when the absolute numbers are so small. To give another example, I just did some very ad-hoc tests on some URIs at a department of a well-known UK university, and the results were rather revealing! The total response time (get URI of NIR, receive 303 response, get URI of IR, receive 200 OK and resource representation back) took ~10s, of which ***over 90%*** was taken up by waiting for the page/IR about that NIR to be generated! (and that's with curl, not a browser, which may then pull in a bunch of external dependencies). In this kind of situation I think there are other, bigger issues to worry about than the <1s taken for a 303-based roundtrip!! > What's more, it puts load on our servers which doesn't > happen when the data is cached; the more load, the slower the responses to other important things that are hard > to cache, such as free-text searching. Again, we need data here that takes into account the big picture. What is the real impact of handling 303s compared to, say, hitting a database and processing the results on the server side? Given the complexity of web apps vs the complexity of server-side code to do 303s I wonder if this server load argument is a red herring. Only real data will tell us. > The consequence of course is that for practical reasons we design the site not to use generic identifiers for items > of legislation unless we really can't avoid it and add redirections where we should technically be using 404s. The > impracticality of 303s has meant that we've had to compromise in other areas of the structure of the site. > > This is just one data point of course, and it's possible that if we'd fudged the handling of the generic identifiers > (eg by not worrying about when they should return 404s or 300s and just always doing a regex mapping to a > guess of an equivalent document URI) we would have better performance from them, but that would also have > been a design compromise forced on us because of the impracticality of 303s. (In fact we made this precise > design compromise for the data.gov.uk linked data.) > >> 3) hard data about occurrences of different patterns/anti-patterns; we >> need something more concrete/comprehensive than the list in the change >> proposal document. > > Yes, it would be good for someone to spend a long time on the entire webdatacommons.org corpus in a rigorous rather than a couple of hours in an evening testing URIs based on sifting through a couple of the files by eye. > >> 4) examples of cases where the use of anti-patterns has actually >> caused real problems for people, and I don't mean problems in >> principle; have planes fallen out of the sky, has anyone died? Does it >> really matter from a consumption perspective? The answer to this is >> probably not, which may indicate a larger problem of non-adoption. > > I don't personally have any examples of this; that doesn't mean they don't exist. > > > Anyway, back to process. The TAG could try to pull the larger community into contributing evidence around these issues. There's already the AWWWSW Wiki at > > http://www.w3.org/wiki/AwwswHome > > which gathers together lots and lots and lots of writing on the topic, a portion of which is based on experience and studies of existing sites, which we could structure around the questions above and add to. > > Having thought about it though, I'm not sure whether any of these would actually succeed in furthering the argument, particularly as much of the data is likely to equivocal and not lead to a "given the evidence, we should do X" realisation. > > Taking the third one above for example, studying webdatacommons.org, I think people who back the status quo > will point at the large amount of data that is being produced through blogging frameworks and the like and claim > that this shows that publishers are generally getting it right. People who want to see a change will argue that the > data that's there could have been published much more easily, and wonder how much more there would be, if > httpRange-14 weren't so hard to comprehend. You're probably right. Guess it's down to the TAG to 'adjudicate'... :| > Basically, I fear that we're just likely to end up arguing over the interpretation of any evidence we collect which > would leave us no further on. I don't mean to nix the idea, I'm just a little pessimistic. I think we've already managed to bring a bit more clarity* to the discussion - thank you :) Tom. -- Dr. Tom Heath Senior Research Scientist Talis Education Ltd. W: http://www.talisaspire.com/ W: http://tomheath.com/
Received on Friday, 30 March 2012 16:22:30 UTC