I'd prefer to use the table from the larger corpus if folks accepted it,
and the bas64 improvement you quote is a fairly big deal.
I have nothing to assure us that my input data was anywhere near the same
quality (which would be why I convinced Johnny to do this experiment!
Thanks Johnny!!!), and every reason to doubt it.
-=R
On Tue, Jun 10, 2014 at 2:20 PM, Martin Thomson <martin.thomson@gmail.com>
wrote:
> On 10 June 2014 13:40, Mike Bishop <Michael.Bishop@microsoft.com> wrote:
> > Given that the difference is slight (less than 3% better), we would
> support accepting this as confirmation that the current code is good and
> leave things as they are, unless broader deployment shows that the Huffman
> table is further off-base for common headers than this indicates.
>
> The straight base64 tests I've run show a compound performance
> increase of ~7.5%, which would seem to be fairly significant.
>
> I don't know about the size of the data set that Roberto used
> originally, nor do I know how good the distribution of both sets of
> test data were (I might theorize that these are too WEIRD [1], or
> maybe even too biased toward Google's workloads).
>
> I'm somewhat inclined toward a change here, but mainly due to the
> robustness of the data set. Unless Roberto can produce evidence that
> his input data was comparable in quality. It took me less than five
> minutes to make a change in my code, so that's not the issue.
>
> [1]
> http://schott.blogs.nytimes.com/2010/07/14/weird/?_php=true&_type=blogs&_r=0
>
>