Data storage

Theoretical breakthrough at MIT could increase data storage


New work on linear probing hash tables of MIT CSAIL could lead to more efficient data storage and retrieval in computers.

A trio of researchers that includes William Kuszmaul – a computer science doctoral student at MIT – have made a discovery that could lead to more efficient data storage and retrieval in computers.

The team’s findings relate to “linear poll hash tables,” which were introduced in 1954 and are among the oldest, simplest, and fastest data structures available today. Data structures provide a means to organize and store data in computers, with hash tables being one of the most commonly used approaches. In a linear probing hash table, the positions in which information can be stored lie along a linear network.

Suppose, for example, that a database is designed to store the Social Security numbers of 10,000 people, suggests Kuszmaul. “We take your social security number, x, and then we will calculate the hash function of x, h (x), which gives you a random number between one and 10,000.” The next step is to take that random number, h (x), go to that position in the array and put x, the social security number, in that place.

If there is already something occupying that place, says Kuszmaul, “you just walk up to the next free position and put it there. This is where the term “linear sounding” comes from, as you keep moving linearly until you find some free space. In order to later retrieve that social security number, x, you simply go to the designated location, h (x), and if it isn’t, move forward until you find x or come to to a free position and conclude that x is not in your database.

There is a somewhat different protocol for removing an item, such as a social security number. If you just left a blank space in the hash table after removing the information, it might cause confusion when you later try to find something else, as the vacant location might mistakenly suggest that the item you are looking for. search was not found in the database. To avoid this problem, Kuszmaul explains, “you can go to where the item was deleted and put a little marker there called a ‘gravestone’, which indicates that there was an item here, but that he’s gone now. “

This general procedure has been followed for more than half a century. But all the while, almost everyone using linear poll hash tables has assumed that if you allowed them to be too full, long stretches of occupied points would cluster together to form “clusters.” As a result, the time needed to find a free spot would increase dramatically – quadratically, in fact – taking so long that it would be impractical. As a result, people have been trained to use low-capacity hash tables – a practice that can get expensive by affecting the amount of hardware a business has to buy and maintain.

But this centuries-old principle, which has long militated against high load factors, has been completely upset by the work of Kuszmaul and his colleagues, Michael Bender of Stony Brook University and Bradley Kuszmaul of Google. They found that for applications where the number of inserts and deletions remains roughly the same – and the amount of data added is roughly equal to that deleted – linear polling hash tables can run at capacities storage without sacrificing speed.

In addition, the team devised a new strategy, called “graveyard hashing,” which involves artificially increasing the number of tombstones placed in an array until they take up about half of the free slots. These tombstones then reserve spaces that can be used for future insertions.

This approach, which goes against what people are used to doing, says Kuszmaul, “can lead to optimal performance in linear poll hash tables.” Or, as he and his coauthors maintain in their article, “the well-designed use of tombstones can completely change the … landscape of linear sounding behavior.”

Kuszmaul wrote these findings with Bender and Kuszmaul in an article published earlier this year that will be presented in February at the Foundations of Computer Science (FOCS) Symposium in Boulder, Colorado.

Kuszmaul’s doctoral thesis supervisor, MIT computer science professor Charles E. Leiserson (who was not involved in this research), agrees with this assessment. “These new and surprising results overturn one of the oldest misconceptions about the behavior of hash tables,” says Leiserson. “The lessons will reverberate for years among theorists and practitioners.”

As for translating their results into practice, notes Kuszmaul, “there are many considerations that go into building a hash table. Although we have taken the story a great deal from a theoretical standpoint, we are only just beginning to explore the experimental side of things.

Reference: “Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering” by Michael A. Bender, Bradley C. Kuszmaul and William Kuszmaul, July 2, 2021, Computer Science> Data structures and algorithms.
arXiv: 2107.01250