Re: totalItems vs void:triples from Kjetil Kjernsmo on 2014-10-14 (public-hydra@w3.org from October 2014)

From: Kjetil Kjernsmo <kjetil@kjernsmo.net>
Date: Tue, 14 Oct 2014 21:49:12 +0200
To: public-hydra@w3.org
Message-Id: <201410142149.12315.kjetil@kjernsmo.net>

On Tuesday 14. October 2014 17.18.57 Ruben Verborgh wrote:
> Reasonably accurate sounds fine indeed.

Actually, I'm -1 to that :-)

> But that of course depends on how "reasonably accurate" hydra:totalItems
> is defined. As far as it is implemented now, the best possible estimate
> is used for void:triples, and I don't see any reason to do otherwise.
> Is that best possible estimate good enough for hydra:totalItems?

I'm thinking in terms of statistics, and I also note that we do not have any 
way to express uncertainty.

Best possible is a very inaccurate term. :-) You could envision a system 
with a sampling algorithm, and then, you set your sample size based on the 
confidence level. If you want a high confidence level, then the cost is 
higher, because you need a larger sample. And "best possible" means you are 
quite free to choose a confidence level based on the cost you, as the server 
owner, is prepared to pay for estimating the number of triples. The 
confidence level needs to be pretty high, but as long as it is not specified, 
I think you'd be fine choosing something good enough, as you can say "well, 
it is the best I found I could defend paying for".

Exact, OTOH, means you have to count them all. Period. It may be outdated 
the next second, true, but you have to count them. IMHO. It isn't the 
changing server state that should be the distinction, it is whether you are 
allowed to use sampling to derive the triple count.

Then, the influence on the cost model and thus query execution may be quite 
substantial. Actually, I think we need to start working on cost models where 
uncertainty plays a role. Anybody want to join me in such an effort?

Cheers,

Kjetil

Received on Tuesday, 14 October 2014 19:49:48 UTC