Sketch of the Day: PCSA — Probabilistic Counting with Stochastic Averaging

Please note that m = 32 (meaning that there are 32 bitmaps) but only the first 8 are shown. This was done to keep the vertical size of the simulation to sane bounds while providing "interesting" error bounds on the algorithm.

Usage

  • Step: Clicking the step button will generate a new random value and insert it into the DV sketch;
  • Play: Clicking the play button will start the simulation running. Random values will be generated and inserted into the sketch with a delay of 'n' ms between each value. The play button will change to the "Stop" button;
  • Reset: Clicking the reset button will stop a running simulation and reset the values to their defaults;
  • Auto Scale: If checked then the step size (the number of values that are generated for each step) will be automatically changed to based on the number of values that have already been added to the sketch. This provides for a smoother visualization across the span of values;
  • Step Size: The number of values that will be generated and inserted into the sketch for each step;
  • Delay: The number of miliseconds between generating new values used when the simulation is played. This can be reduced to make the simulation play faster or increased to make it play slower;

Simulation Elements

  • Value: The last randomly generated value (between 1 and 16,777,215) that was inserted into the sketch. At most 50k values will be generated;
  • Hash Value: The Jenkins hash of the value;
  • Bit String: The binary representation of the lower 24bits of the hash value;
  • Index: The lowest b bits used to determine the index of the bitmap whose value is to be updated. (m=2b)
  • Value: The next 2L - 1 bits (where L is the bitmap width) from which the bitmap index -- the run of 0s -- is determined;
  • Bitmaps: The first 8 bitmaps. Index 0 is on the right. 1s are highlighted in red to make the fringe easier to see;
  • m: The number of buckets or bitmaps (m=2b)
  • Actual Cardinality: The cardinality of the set of all inserted values (i.e. the count of distinct values);
  • Algorithm: The distinct value (DV) algorithm;
  • Estimated Cardinality: The cardinality of the set estimated by the algorithm (see below);
  • % Error: The difference between the estimated and actual cardinality over the actual cardinality;
  • Line Graph: Plots the actual cardinality on the x-axis and the % Error on the y-axis for the each algorithm. The algorithms are described below. The blue area is the theoretical PCSA error bounds (0.78 / sqrt(m));
  • Histogram: The distribution of the PCSA and HLL register values along with the arithmetic, geometric and harmonic means of the PCSA distribution. Please note that the harmonic mean used in HLL is 1/2M whereas this is simply 1/M and is present for illustrative purposes only;

Algorithms

  • Ball-Bin: "Linear Counting" from the HLL paper (m * log(m/V) where m is the number of bitmaps and V is the number of those bitmaps that are contain all zeros);
  • PCSA-Cor: PCSA with small cardinality correction from "Near-Optimal Compression of Probabilistic Counting Sketches for Networking Applications" (m / rho * (2Z/m - 2-k*Z/m) where m is the number of bitmaps, rho is 0.77351, Z is the indicator value and k is 1.57);
  • PCSA: The PCSA estimator;
  • HLL: The HyperLogLog estimator;

Footnotes

  • Refer to the corresponding blog post for more information about Probabilistic Counting with Stochastic Averaging.
  • Certain versions of Safari will incorrectly compute the harmonic mean in the distribution above. This is being investigated.
  • We have open sourced this simulation so that you are free to go in and play with its guts to learn more about these amazing algorithms. You can increase the bitmap width (L), number of bitmaps (via b), etc.

Have fun and happy counting!

Rob Grzywinski