The Undergraduate Who Broke a 40 Year Assumption About Hash Tables

The Undergraduate Who Broke a 40 Year Assumption About Hash Tables and what this means for Memory

👁257views

Andrew Krapivin, while tinkering with hash tables as a side project at Rutgers University in 2023, discovered a lookup method faster than computer scientists had considered theoretically possible, invalidating a conjecture that had stood for roughly four decades. His breakthrough emerged not from a formal research agenda but from an undergraduate's curiosity while independently reading his professor's published work.

CloudScale AI SEO - Article Summary
  • 1.
    What it is
    Hash tables with open addressing and no reordering were assumed to have a hard Θ(x) lookup cost ceiling since Andrew Yao published his conjecture in 1985. This article explains exactly how an undergraduate disproved that 40 year old assumption by building a faster hash table while tinkering with an unrelated memory pointer problem.
  • 2.
    Why it matters
    Understanding why Yao's conjecture failed matters because primary clustering was treated as a structural inevitability in open addressed hash tables, not a solvable problem, meaning this result changes what engineers can expect from one of the most widely used data structures in computing.
  • 3.
    Key takeaway
    The Θ(x) cost ceiling for open addressing without reordering was not a proven theorem but an unverified conjecture, and a Rutgers undergraduate invalidated it as a side effect of trying to shrink memory pointers.
~21 min read

1. A Side Project That Ended a Conjecture

In 2023, Andrew Krapivin was an undergraduate at Rutgers University reading a paper on his own time, for fun, that his professor Martín Farach-Colton had co-authored two years earlier. The paper was called “Tiny Pointers”, and it described a new kind of memory reference object that could replace conventional pointers with far less space overhead. Krapivin thought he could make the pointers even smaller, but to do that he needed a more efficient way to organise the data those pointers would point to, so he reached for a hash table and started tinkering.

What he discovered in the tinkering was not a better pointer. It was a faster hash table than any computer scientist had believed was achievable, and in building it he had invalidated a conjecture that Andrew Yao had published in 1985 and the field had treated as settled for the forty years since. The paper, “Optimal Bounds for Open Addressing Without Reordering”, co-authored with Farach-Colton and William Kuszmaul of Carnegie Mellon, was posted in January 2025. Cornell Tech researcher Alex Conway described it directly: “Hash tables are among the oldest data structures we have. And they’re still one of the most efficient ways to store data” — which makes the fact that a core assumption about their limits turned out to be wrong all the more striking.

What changed

Before (Yao, 1985)After (Krapivin et al., 2025)
Greedy open addressingΘ(x) worst case — considered optimalO((log x)²) — proved optimal
Non-greedy open addressingNot seriously pursuedO(1) average lookup, asymptotically constant
Production statusDeployed in every major runtimeResearch result; engineering work required to ship

x denotes how close the table is to 100% full: x = 100 means 99% full, x = 1,000 means 99.9% full.

2. The Car Park That Explains Everything

The video that brought this story to a wider audience uses a car park analogy to explain hash tables, and it is the right analogy, not just as a simplification for non-technical audiences but because the formal mathematics of open-addressed hash tables and the mathematics of parking functions are literally the same problem. When computer scientists first studied linear probing in the 1960s, they formalised the collision resolution process as drivers arriving at a car park, each with a preferred space. If the preferred space is taken, the driver moves forward until finding an empty one. The combinatorial structure that results is called a parking function, and it is precisely the structure of an open-addressed hash table with linear probing.

So: imagine a long car park with one thousand numbered spaces. Each car has a preferred space, determined by a function applied to its registration plate — the hash function. Most of the time, especially when the car park is nearly empty, cars find their preferred space free and park immediately. This is the constant-time lookup that makes hash tables attractive. The problem begins when the car park fills up. When x is 100, the car park is 99 percent full: 990 spaces occupied, 10 empty, scattered somewhere along the row. A new arrival whose preferred space is taken cannot jump to an empty spot at random — it must drive forward from its preferred position and check each space in sequence until it finds a gap. At 99 percent occupancy, it might need to check ten, twenty, fifty spaces before finding one. When x is 1,000, the car park is 99.9 percent full and the expected search for that last empty space becomes proportionally catastrophic. Kuszmaul put this concretely: “If your hash table is 99% full, it makes sense that you would have to look at around 100 different positions to find a free slot.” This is the Θ(x) relationship that Yao formalised in 1985 as a hard ceiling, which is to say: in a nearly full car park, the time to find a space scales with how full the car park is, and no parking rule can avoid this.

The key constraint — and the one that makes the problem hard — is that once a car is parked, it stays. You cannot reorganise the car park between arrivals. You cannot shuffle existing cars to create a more efficient arrangement. Every space is occupied by whoever took it first, and every new arrival has to navigate around the pattern that all previous arrivals have already created. This is the no-reordering constraint of open addressing without reordering, and it is the condition under which Yao claimed no algorithm could beat Θ(x). The conjecture shaped the direction of the field for four decades.

3. Clustering: When the Car Park Turns Against You

The mechanism that makes the Θ(x) ceiling genuinely painful, and the reason Yao’s result was taken so seriously, is primary clustering. It is not simply that a full car park takes longer to navigate. It is that a greedy car park actively makes itself harder to navigate over time, compounding the cost of every earlier poor decision.

Here is the dynamic. When a car arrives and its preferred space is taken, it drives forward and takes the first available space. This is the greedy policy: take the nearest empty slot. The problem is that this car is now sitting in a space that might have been the preferred space for the next arriving car. That car, finding its preferred space occupied, must drive past both the original car and the newcomer, and when it parks, it blocks the preferred space for a car arriving after it. Each greedy decision extends the contiguous run of occupied spaces by one. A run of ten occupied spaces does not merely slow down the ten cars whose preferred spaces fall within it: it acts as a magnet that captures every subsequent arrival whose preferred space falls anywhere in the run, forcing them all to probe past it and land beyond the far end, making the run eleven spaces, then twelve, then longer still. The longer the run, the more likely future arrivals are to hash into it, and the faster it grows. Krapivin described the consequence directly: “Most data elements are inserted really, really quickly, but some take an incredible amount of time, because essentially if the table fills up, you have no idea where to look.” The cars that take an incredible amount of time are not unlucky. They are the predictable downstream cost of every prior greedy decision.

This is why Yao concluded that Θ(x) was structural. Clustering is not a flaw in any particular insertion strategy: it is the natural consequence of greedy insertion into a fixed array when you cannot move things around. Uniform probing, which randomises the probe sequence rather than checking adjacent spaces in order, was his answer to managing clustering rather than preventing it — it reduces the chance of a probe sequence walking directly into a growing run — but even this could not break the Θ(x) relationship at high load. The car park eventually becomes so dense that no routing strategy can navigate it efficiently. Yao’s conjecture said this was the irreducible floor.

4. Elastic Hashing: Choosing Your Space More Carefully

The critical detail in Krapivin’s story is that he built his new hash table without knowing Yao’s conjecture existed. He was not trying to beat the Θ(x) ceiling. He was trying to make tiny pointers work, and the hash table he designed for that purpose happened to escape the ceiling entirely. “I did this without knowing about Yao’s conjecture,” he said. The conventional wisdom did not constrain him because he had not encountered it.

What his design does is change the parking rule. Instead of taking the first available space after your preferred one — the greedy rule that reliably grows clusters — each new arrival in Krapivin’s scheme drives further down the car park than it strictly needs to, scanning ahead for a space whose occupation would not extend an existing run or create a new bottleneck for future arrivals. Having identified that better space, the car snaps back to park there. The probe reaches further than a greedy driver would bother going, and the final position is chosen for its effect on future traffic rather than its proximity to the entrance. The team named this approach elastic hashing precisely because the probe range stretches elastically and then contracts to the chosen position.

The practical result is that clusters never fully form. Because every incoming car is parked with an awareness of the structural cost its placement imposes on subsequent arrivals, the self-reinforcing accumulation that drives the Θ(x) cost in greedy schemes never takes hold. The car park at 99 percent occupancy still has the same fraction of empty spaces, but they are distributed in a way that keeps probe sequences short rather than concentrated into a few long, growing runs that dominate the search time for everything that follows. Under the proved model, average query time remains asymptotically constant even at very high load factors, which is the result that Yao’s conjecture said was impossible.

The performance results across the two hash table variants tell the full story. For greedy open addressing without reordering, the team proved optimal worst-case query and insertion times of O((log x)²) — still much better than the Θ(x) Yao conjectured, and now proved to be the tightest achievable bound. For non-greedy open addressing, the result is more consequential still: constant average query time regardless of how full the table is, because elastic placement prevents the clustering that would otherwise force query time to scale with density. The paper provides matching upper and lower bounds throughout, meaning these are not just improvements on previously known results but the best results that are mathematically possible. As Sepehr Assadi of the University of Waterloo put it: “It’s not just that they disproved Yao’s conjecture, they also found the best possible answer to his question. We could have gone another 40 years before we knew the right answer.”

5. Tiny Pointers: The Bookkeeping That Makes It Work

Non-greedy placement sounds straightforward in the car park analogy, but it creates an immediate practical problem. If each arriving car is going to scan ahead, evaluate potential spaces, and choose a position based on its effect on future arrivals rather than its own proximity to the entrance, the car park needs a way to record which regions are well-distributed and which are beginning to cluster. Each car needs to carry a note, or read a notice board, that tells it where the good spaces still are. In a real car park this is trivially managed. In a hash table running in memory on a production system, the metadata required to support non-greedy placement is a non-trivial cost, and for decades it was assumed to be prohibitively expensive.

This is where the Tiny Pointers paper becomes the necessary precondition for the hash table result, not just the paper Krapivin happened to be reading when he had his insight.

To understand why, you need to understand what a pointer actually is and why it is expensive. In modern 64-bit computing, every piece of data lives somewhere in memory, and the address of that location requires 64 bits to express because the total addressable memory space of a 64-bit system spans 2⁶⁴ possible locations. A pointer is just that address: a 64-bit number that says “the thing you are looking for is here.” As Farach-Colton described it: “Pointers started out at eight bits. Then, as computers got bigger, pointers grew too: 16 bits were needed, then we went to 32 bits and 64 bits.” Every time the total address space expanded to accommodate more memory, pointers grew with it, because a pointer that cannot represent the full address space is a pointer that cannot find everything.

The problem is that 64 bits is 8 bytes, and when you are building a hash table that needs to store a small piece of navigational metadata per element — a note about where a better space exists, a reference back to a sub-region of the table — paying 8 bytes per pointer for that metadata can easily double or triple the memory footprint of the entire structure. For a hash table whose core value proposition is efficient memory use, that overhead is not a minor inconvenience. It makes non-greedy approaches impractical at scale.

Tiny pointers solve this by exploiting a constraint that ordinary pointers ignore: locality. If you know in advance that the thing you are pointing to lies within a bounded region of memory rather than anywhere in the full 64-bit address space, you do not need 64 bits to address it. You only need enough bits to distinguish between the possible locations within that region. If a sub-array in a hash table has 256 slots, a pointer that stays within that sub-array needs only 8 bits, not 64. The tiny pointer is not an absolute address into the full memory space — it is a relative offset within a small, known context. The formal result is that if a tiny pointer references an item in an array filled to load factor 1 − δ, the optimal tiny pointer size is O(log log log n + log(1/δ)) bits, which for typical parameters is a few bits rather than 64. The price paid is a small constant overhead in lookup time and a dereferencing table that translates the relative offset back into a full address when needed, but these costs are bounded and manageable.

In the car park analogy: instead of every car carrying a complete city map so it can navigate to any space anywhere in the city, each car carries a small slip of paper that records its position relative to the nearest signpost in its local zone. The slip is far smaller than a full map, because it only needs to describe a bounded local area rather than the entire city. When the car needs to navigate further, it reads the signpost to reorient itself, paying a small cost for the translation. The slip of paper is a tiny pointer. The signpost and the translation layer together are the dereferencing table.

What Krapivin recognised when he came back to the Tiny Pointers paper in 2023 was that this bounded-locality property was exactly what a non-greedy hash table needed. Elastic hashing divides the hash table into sub-arrays and manages placement within each sub-array using tiny pointers that encode local position information at a fraction of the cost of full 64-bit pointers. The structural metadata that makes non-greedy placement possible — the bookkeeping about which regions are becoming dense and which still have good spaces — can be maintained using a handful of bits per element rather than eight bytes per element. The clever parking choice and the cheap bookkeeping enable each other in a way that neither paper, read in isolation, would have suggested. Krapivin read them together for fun, and the combination is what produced the result.

6. Why Greedy Algorithms Lie to You About Cost

There is a principle embedded in this result that extends well beyond hash tables and is worth sitting with for anyone who builds or operates complex systems. Greedy algorithms are appealing precisely because they are simple to reason about: at each step, make the locally optimal choice, do not model the future, just take the nearest available slot, assign the next available resource, approve the next item in the queue. The cost accounting looks clean in the moment because it genuinely is clean in the moment. The problem is that local optimisation is not neutral with respect to future state. Every greedy decision reshapes the landscape that subsequent decisions have to navigate, and the degradation it introduces is typically invisible on whatever dashboard you are using to measure present performance.

Krapivin’s non-greedy approach inverts this by spending additional compute at insertion time to evaluate future position, refusing locally appealing placements that would transfer cost into later queries, and probing further to find a slot that preserves future navigability. The upfront overhead is real and measurable. The payoff is that average query time remains constant regardless of how full the table gets, because the structural degradation that greedy insertion was silently accumulating never accumulates in the first place. The lesson maps directly onto systems design, architecture decisions, technical debt, and organisational process. Greedy optimisation of the present is very often a deferred tax on the future, one that accrues interest quietly and then presents its entire invoice at once when the table gets full enough that the slack runs out.

7. So What, and When?

It is worth being direct about what this result does and does not change, and on what timescale.

The honest answer is that nobody running production infrastructure will see any difference from this paper in the near term. Farach-Colton said as much himself: most hash tables of this theoretical type rarely make the transition from research to deployment, and the engineering distance between a proved bound and a shipped implementation is substantial. Turning elastic hashing into a production data structure that outperforms mature existing implementations like Robin Hood hashing or Swiss tables — both of which have been tuned over years against real-world cache behaviour, memory allocator characteristics, and CPU branch prediction patterns — requires work that has not yet been done. There is no elastic hashing library to drop into your codebase today. The paper does not come with a pull request for Redis.

The timeline for benefit breaks into three distinct layers, operating at different speeds.

The first layer is theoretical, and it is already complete. Researchers who were working on hash table problems under the assumption that Θ(x) was the floor now know it is not. That changes which problems are worth pursuing and which approaches are worth investing in. For the subset of the field working on data structure lower bounds and cache-efficient algorithms, the map changed in January 2025 and they are already navigating by the new one. This has no direct operational impact, but it reshapes the research agenda that will eventually produce the next generation of deployable data structures.

The second layer is implementation, and it is likely five to ten years away from serious production consideration for most organisations. The path from a proved bound to a reference implementation, then to a benchmarked comparison against existing production systems, then to integration into a widely used library or runtime, then to adoption at scale, is long. For comparison, the theoretical foundations of Robin Hood hashing were established in the 1980s and it took until the 2010s to see widespread production adoption. Swiss tables, developed at Google and open-sourced through the Abseil library in 2017, took several years before becoming the default hash map in significant codebases. Elastic hashing has none of that engineering work done yet. For most engineering teams the practical question is not “when do we get this?” but “who is doing the implementation work that makes this shippable?”

The third layer is the one that arrives soonest and is already here: the removal of a false constraint from the design space of anyone building hash-table-adjacent systems from scratch. This matters most for teams working on new storage engines, new cache designs, or new embedded databases where the choice of hash table implementation is still open. For the past four decades, the theoretical ceiling on open-addressed hash tables was accepted as a design constraint, which meant that constant query time at very high load factors was simply not considered a viable target. That constraint is now gone. A team designing a new system today can set constant-time lookup at near-full occupancy as a design goal and know that it is theoretically achievable, even if the reference implementation for getting there does not yet exist.

For most organisations, the practical implication is: watch the implementation work, not the paper. The paper is the proof that the destination exists. The engineering work that makes it reachable is still ahead.

8. The Person and the Path

Andrew Krapivin grew up in Fair Lawn, New Jersey, in a family where the path from mathematical talent to serious research was already mapped: his older brother Viktor is also a Goldwater Scholar and a Rutgers alumnus. He traces his own interest in mathematics to fourth grade, when his mother enrolled him in Art of Problem Solving, a platform built for mathematically gifted students that teaches through competition-style problems rather than procedural drill. He moved through its curriculum quickly, and while still in school began programming robots to navigate mazes and play simplified soccer — an early indication that theory and implementation were never separate interests for him.

He enrolled at Rutgers University in 2020 as a double major in mathematics and computer science, taking graduate-level courses from his first year. It was there that he joined Farach-Colton’s research group, began working on data structure problems with VMware researchers, and in the fall of 2021 first encountered the Tiny Pointers paper — setting it aside without pursuing it, a detail that matters only in retrospect. He came back to it in 2023, reading it on his own time, and that rereading is where the hash table discovery began.

Farach-Colton’s description of what made Krapivin unusual is worth quoting in full. What struck him most was not the hash table result itself, but an earlier episode: “This is very rare this early on in the research career of a student. Normally, the student is given a problem by the adviser. But Andrew read one of my papers and found an open problem that I hadn’t even considered. And he came up with a great solution.” He went on to describe Krapivin as the best undergraduate he had worked with in thirty-two years at Rutgers. He received the Goldwater Scholarship in 2023 and the Churchill Scholarship in 2024 — the first Rutgers undergraduate in a decade to do so — which funded a Master of Philosophy in Advanced Computer Science at Cambridge. He is now a PhD student at Carnegie Mellon University.

The interesting thing is not that an undergraduate broke the assumption. It is that he was solving a different problem entirely, which meant he inherited none of the field’s constraints. He crossed domains — tiny pointers to hash tables to lower bounds — without the prior art of each domain telling him what was impossible. The conjecture was not wrong because no one was clever enough to find the error. It was wrong because almost no one was looking, and the person who found it was not looking either.

9. What This Means for the Infrastructure Layer

Hash tables are not an academic abstraction. They sit inside every database, every cache, every runtime, every compiler, every operating system. Redis is substantially a hash table. Python dictionaries are hash tables. The JVM, the V8 engine, and every modern key-value store rest on hash table fundamentals, which means that the theoretical limits of hash table performance are, at some remove, the theoretical limits of a large portion of the world’s production infrastructure. When a result improves those understood limits, the practical implication is not that you need to rewrite anything today. The implication is that the next generation of storage engines, caches, and lookup systems will be designed with a fundamentally different understanding of what performance guarantees are achievable, particularly at high load factors where the old theory said degradation was unavoidable.

For organisations that run infrastructure near capacity and care about lookup latency under high load, the elastic hashing result opens a design space that did not previously exist on the map. Under the proved model, constant average query time at very high load factors is not a marginal improvement over a gracefully degrading alternative. At scale, operating near the top of the load curve, it represents a qualitatively different category of guarantee.

But perhaps the more durable lesson here is about exploration versus optimisation. Most engineering organisations are structured for optimisation: they identify known bottlenecks, instrument them, and reduce them. The entire apparatus of observability, runbooks, performance reviews, and capacity planning is built for this mode. What almost no organisation does systematically is revisit the assumptions underneath the systems they are optimising. Krapivin was not optimising a hash table. He was solving a different problem — making tiny pointers smaller — and crossed into hash table territory only because he needed a better one as an intermediate step. That domain crossing, with no prior art telling him what was impossible in the new domain, is what made the result possible. The conjecture had been sitting there for four decades because the people most likely to examine it were also the people who had absorbed it as background knowledge. The person who finally examined it had not.


References