RE: New document on "Simple hit-metering for HTTP" from Paul Leach on 1996-08-06 (ietf-http-wg@w3.org from July to September 1996)

From: Paul Leach <paulle@microsoft.com>
Date: Tue, 6 Aug 1996 11:06:29 -0700
To: "'koen@win.tue.nl'" <koen@win.tue.nl>
Cc: "'mogul@pa.dec.com'" <mogul@pa.dec.com>, "'http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com'" <http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com>
Message-Id: <c=US%a=_%p=msft%l=RED-77-MSG-960806180629Z-1840@tide21.microsoft.com>
Quick poll for cache implementers (read below for context):

If you were going to implement hit-counting in some form akin to what's
in the draft, how amenable would you be to implementing both "use-count"
(counts unconditional GETs) and "get-count" (counts all accesses, even
if they don't cause a new copy of the page to be sent), instead of just
"use-count".

>----------
>From: 	koen@win.tue.nl[SMTP:koen@win.tue.nl]
>Subject: 	Re: New document on "Simple hit-metering for HTTP"
>
>
>I thought up another fix some hours after I sent my message: adopt the
>rule that a server which is only relaying, not generating, a response
>must not count it.  In the above example, the origin server itself
>would count the hit, not any upstream proxy cache.
>
>This way, when serving a request the origin server can immediately add
>1 to its total hit count.  With my earlier fix, it would have to wait
>for proxy 2 to report the hit in a future request.

Thanks. Sounds cleaner.
>
>>>3. A `hit' being an *un*conditional GET
>>>
>>>In the current (classic) meaning of the word,
>>>
>>>  1 hit-classic = 1 request on an origin server.
>>>
>>>Your draft defines a new kind of hit:
>>>
>>>  1 hit-new = 1 200/203/206 response returned to a user agent.
>>>
>>>Now, if I am an origin server which uses cache busting, and if most
>>>caches play by the rules, then for my server I will measure:
>>>
>>>  hit-new < hit-classic .
>>>
>>>Assuming that I get payed by the hit, I have absolutely no incentive
>>>to start measuring hit-news instead of hit-classics.  To stop using
>>>the cache-busting based hit-classics would be economic suicide.
>>

Actually, it just occurred to me -- you're definition of "hit-classic"
is bogus. We (at least I) do not know how sites count hits. All we know
is that when they bust the cache, they see all requests. They could
count both conditional and unconditional GETs as hits, or they could
count just unconditional ones. They could count HEADs, or not. Etc.

So I've sent out a poll to ask any content providers how they count
hits.

>>No, the  payment per hit-new would just be higher than for per
>>hit-classic.
>
>Your answer assumes a level of maturity in the payer which simply is
>not present yet.  This market is far too young, and far too dominated
>by customers who respond to the `your company will be obsolete if you
>do not get on the web *now*' hype.
>
>Look, if all hit count customers were sophisticated enough to pay more
>for `high quality' hits, advertising sites could not make more money
>by using cache busting.  So if your assumed level customer of maturity
>was indeed present, there would not be a cache busting problem to
>solve in the first place.

Huh? Until we deploy something in caches, "cache-busting" is the only
way they have to get demographic data. It seems to me that the people
collecting demographic data have been pretty clever in how they do it --
I've heard stories about creating funny URLS to track your identity
across sessions, and your trick for figuring out who the referring site
was, etc.
>
>I can't see the level of maturity you assume being present now
>anywhere except maybe in a few very big web advertisers.

Gotta start somewhere. Remember, a content provider that stops cache
busting will reduce load on its site and decrease response time to see
the content (and the ad). These will be powerful incentives.
>
>For the companies who responded to the `your company will be obsolete
>if you do not get a home page *now*' hype by getting some noname
>startup to create and maintain a homepage (usually containing at least
>200Kb of images) for them, this level of maturity will not appear for
>some time.
>
>I guess we could argue at length about present and predicted levels of
>maturity (both in the US and in Europe), but the bottom line is this:
>
> I feel very strongly that it would be a huge mistake to make a scheme
> to eliminate cache busting dependent on something, sufficient
> customer sophistication in this case, the existence of which is
> questionable at best.

First, this isn't a scheme to "eliminate" cache busting, it is only
trying to reduce it.

Second, really unsophisticated customers just won't use it -- they'll
continue cache-busting.

Moderately unsophisticated existing content providers who use it won't
even know that the kind of hits that are counted are a little different.

Within a year of deployment, more than half of the content providers
will be new. They also won't know that the kind of hits that are counted
are different.

>
>This dependence on customer sophistication can easily be removed by
>counting hits in a way which gives *higher* counts even to sites which
>now use cache-busting.  By generating a higher counts even for these
>sites, the scheme will work no matter how (un)sophisticated the
>customers are.
>
>Again, there is no need to count only one thing.  You could count both
>the `meaningful' type of hits you have defined, and my `inflated'
>hits.  That would even let you measure cache efficiency in a neat way.

At the cost of increased complexity in the cache implementation and
hence longer until it gets deployed.
So, I put the quick poll at the top of this message.
>
>>>4. Interaction with Vary
>>>
>>>I don't like the extra complexity and inefficiency introduced by the
>>>Vary counting rules in section 3. (See second-to-last sentence of
>>>Section 5.1.)
>>>
>>>I think the proposal would be better if the Vary special case were
>>>removed entirely.
>>
>>Don't you think that providers of multilingual content want to count the
>>hits of the French, German, Dutch, and English (etc.) version
>>separately?
>
>Sure!  There will also be providers who want the referer info, the
>e-mail addres, and a credit rating of the user for every hit on a
>page.
>
>The question here is where to strike the balance between
>simplicity/efficiency and the need for statistics.
>
>I feel that your vary scheme adds to much complexity.  Making it
>Etag-based would help, but it would still be a bit too complex for my
>taste.  I guess the WG as a whole will have to decide on where to
>strike the balance.

I don't feel you made a strong enough argument that keeping a counter
per cache entry is too complicated. Or even that, in order to be
(completely optionally) "cooperative", sending a HEAD (without waiting
for a response) for each cache entry before it is purged is too
complicated.

Now, maybe the spec is unclear, but that's all it requires above what's
needed to do Vary.
>
>Also, multilingual pages which use transparent content negotiation
>will not need a Vary-based hit counting scheme: you can quite
>naturally count hits on the variant URLs.
>
>[...]
>>>5. Overhead in proxy efficiency
>>>
>>>I'm wondering if the counting mechanisms in the draft won't cause an
>>>unacceptable overhead for high-performance cache implementations.  I
>>>think we definitely need the opinions of proxy cache implementers on
>>>this issue.
>>
>>The design is supposed to just require addition of a counter to a data
>>structure that needs to be around anyway, and to add a few bytes to a
>>message you needed to send anyway.
>
>I'm primarily worried about the filesystem read/write overhead needed
>to maintain the counters.  I would like to hear an implementer say
>that this is not a problem.

Mine say it isn't. They say that they have to update some info
associated with a cache entry on every hit, just to maintain LRU-type
info. So, keeping the counter is not a big deal.
>
>>  The alternative is that pages for
>>which demographic info is required aren't cached at all, which is
>>plainly MUCH worse.
>
>No, it is not necessarily worse.  It depends on how much cache-busting
>there is now, and on how much your proposal would eliminate.  We are
>comparing an overhead for *every* page served with an overhead due to
>extra *conditional* requests for *some* pages.

>This all boils down to
>the last comment in my previous message: how much cache busting is
>there anyway?

I'm not woried about it. If there are not enough customers demanding
this feature, it won't get implemented very much. But I can say that
there's enough demand that at least one vendor puts it very high on
their list. Of course, they could do something non-standard, but they
actually *really* don't want to.
Received on Tuesday, 6 August 1996 11:09:36 UTC