Counting with Less using Count-Min Sketch

Introduction Blog Post

A count-min sketch is a data structure for estimating how often items occur in a stream of data, using far less memory than keeping exact counts. By the end of this essay you will know:

There's no hard math is required to understand the principles.

Counting when everything fits in memory

Imagine you have a server and need to enforce rate limits: no user should make more than 1000 requests per minute.

The straightforward approach is a hash map keyed on user ID. For every request, increment the map value with that user ID key. At the end of each time window, reset the map.

This works great with 1,000 users. It works ok with 100,000. At scale, this map will be too large. This can become a constraint quickly if there are millions of users, or if you need separate counters per endpoint.

Unique user keysApproximate memory (varies)
10,000~500 KB
1,000,000~50 MB
100,000,000~5 GB

What if you don't need exact counts? "Approximately 500 requests" is almost as useful as "exactly 510 requests" for deciding whether to rate limit. That's where the count-min sketch comes in.

The core idea: a grid of counters

A count-min (CM) sketch is a 2D grid of counters with d rows and w columns, all initialized to zero. Each row has its own hash function. When an item arrives, you hash it once per row to get a column index, and increment the counter at that position. The diagram below will show a small dot to visualize which grid cells are incremented; this is just to help follow the data flow, and the actual grid has numbers only.

To estimate an item's count, you hash it again to find the same positions, read those d counters, and take the minimum. That's the "Min" in Count-Min.

And that's the whole data structure. Let's walk through it.

This type of data structure is called a sketch because it is a compact summary, similar to a thumbnail image. And a sketch is a special class of probabilistic data structure, which introduces some randomness in exchange for efficiency. We often don't need a perfectly exact answer, and just a little inexactness can save huge amounts of computation and memory.

Adding an item

Say the request from user-42 arrives. We have 3 hash functions (one per row), and each one maps the key to a column index. We increment the counter at each of those positions.

I'm showing a hash function with an output range of 8 values for readability; in practice, the range would be in the hundreds or thousands to balance memory and accuracy.

Now a request from user-91 arrives. Its hash functions point to different columns, usually. But collisions can occur.

Column 5 in row 2 now reads 2, even though each key has only appeared once. This is a collision. Two different items landed in the same cell, and their counts got mixed together.

This is why we use multiple rows. A collision in one row usually doesn't mean a collision in every row.

Querying an item

To estimate how many times user-42 has appeared, we hash it again and read the counters at the same positions:

The minimum of {1, 2, 1} is 1. That's correct.

The collision in row 2 inflated one counter, but the minimum across all rows gave us the right answer. This is the key insight: collisions can only add to a counter, never subtract. So the true count is always less-than-or-equal to every counter, and the minimum is the closest estimate.

A count-min sketch never underestimates. It can only overestimate.

By taking the minimum across d independent hash functions, you're picking the row with the least noise. For a collision to inflate all d counters, a different item would need to collide with your item in every single row. That gets very unlikely as you add more columns and rows.

Controlling the error

We can tune the two dimensions to control accuracy:

Give it a try. Adjust the rows, columns, and number of users to see how the CM sketch compares to the true count. The lower-frequency users show the difference between very active power users and less active casual users. Or click to see the simulation at high speed.

Did you notice how the accuracy changed with the number of rows, columns and users? Even with hundreds of users in a 3×8 grid, the high-frequency users (user-42) are estimated within ~10% of the true count. But the low-frequency users are way off. This is a core insight: a CM sketch provides good estimates for frequent items, because their true count dominates any noise from collisions.

Or, another way to think about it: the error is absolute, not relative, which makes the percentage error for low-frequency users look huge.

In practice, you choose w and d based on the error you can tolerate. There's a handy formula for this, but we can also see the effect directly. Try adjusting the width and depth below to run a 10,000-event simulation:

Want error within 0.1% of total count, 99.9% of the time? That's w ≈ 2,718 and d = 7. A grid of ~19,000 counters is far less than millions of map entries.

Conclusion

Back to our example: we can apply a count-min sketch for rate limiting and save some significant memory compared to tracking exact counts. This works well if the API has a few heavy users and many light users, and a long-tail of thousands (or millions) of infrequent users.

The count-min sketch is a clever solution to processing high-volume streams where exact counts are either impossible or not worth the memory. And once you start looking for sketches, you'll start to find them in many applications where a brute force count isn't feasible.

And give the original paper a read: An Improved Data Stream Summary: The Count-Min Sketch and its Applications, by Cormode and Muthukrishnan. It offers even more ways to combine sketches for more advanced use cases, like range queries.