- Count-Min sketch
-
The Count-Min sketch (or CM sketch) is a probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream in many different ways. The algorithm was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan[1].
Contents
Algorithm
Setup
The data structure is parameterized by the constants w and d which determine the time and space needs and the probability of error of the queries. The algorithm needs a two dimensional array with width w and depth d called here count. A series of d hash functions must be randomly drawn from a pairwise independent hash function family.
For later convenience we assign and .
Update
When a new value a arrives we update as follows: , . That is, for each row we take the corresponding hash function, apply it to the newly received value and add one to the column corresponding to the hash value.
Query
The array can then be used to estimate any of several different statistics at any point. If we want to estimate, for instance, the number of times a specific value i appeared so far in the stream we would compute min jcount[j,hj(i)] (this assume all added values are positive). This estimate has the guarantee that with probability 1 − δ.
Small modifications to the data structure can be used to sketch other different stream statistics.
External links
References
- ^ Cormode, Graham; S. Muthukrishnan (2004). "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 55: 29–38.
Categories:- Hashing
- Probabilistic data structures
Wikimedia Foundation. 2010.