Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure

Usage

  • Step: Clicking the step button will generate a new random value and insert it into the DV sketch;
  • Play: Clicking the play button will start the simulation running. Random values will be generated and inserted into the sketch with a delay of 'n' ms between each value. The play button will change to the "Stop" button;
  • Reset: Clicking the reset button will stop a running simulation and reset the values to their defaults;
  • Auto Scale: If checked then the step size (the number of values that are generated for each step) will be automatically changed to based on the number of values that have already been added to the sketch. This provides for a smoother visualization across the span of values;
  • Step Size: The number of values that will be generated and inserted into the sketch for each step;
  • Delay: The number of miliseconds between generating new values used when the simulation is played. This can be reduced to make the simulation play faster or increased to make it play slower;

Simulation Elements

  • Value: The last randomly generated value (between 1 and 16,777,215) that was inserted into the sketch. At most 65,335 values will be generated;
  • Hash Value: The Jenkins hash of the value;
  • Bit String: The binary representation of the lower 24bits of the hash value;
  • Register Index: The lowest b bits used to determine the index of the register whose value is to be updated. (m=2b)
  • Register Value: The next 2r - 1 bits (where r is the register width) from which the register value -- the run of 0s -- is determined. Please note that the register value is the number of leading zeros +1 as per the HLL paper;
  • Register Values: All m=64 register values. Index 0 is in the upper-left and index 63 is in the lower-right. The register that is to be updated is highlighted in red. Please note that the value is only updated if the value from the bit string is greater than the current value;
  • m: The number of buckets or register values (m=2b)
  • Actual Cardinality: The cardinality of the set of all inserted values (i.e. the count of distinct values);
  • Algorithm: The distinct value (DV) algorithm;
  • Estimated Cardinality: The cardinality of the set estimated by the algorithm;
  • % Error: The difference between the estimated and actual cardinality over the actual cardinality;
  • Line Graph: Plots the actual cardinality on the x-axis and the % Error on the y-axis for the each algorithm. The initial ball-and-bins estimation of HLL is explicitly shown. The blue area is the theoretical HLL error bounds (1.04 / sqrt(m));
  • Histogram: The distribution of the theoretical and observed register values along with the arithmetic, geometric and harmonic means of the observed distribution. Please note that the harmonic mean used in HLL is 1/2M whereas this is simply 1/M and is present for illustrative purposes only;

Footnotes

  • Refer to the corresponding blog post for more information about HyperLogLog.
  • Certain versions of Safari will incorrectly compute the harmonic mean in the distribution above. This is being investigated.
  • We have open sourced this simulation so that you are free to go in and play with its guts to learn more about these amazing algorithms. You can increase the register width, number of registers, etc.

Have fun and happy counting!

Rob Grzywinski