System Design
Consistent Hashing
A distributed hashing scheme that operates independently of the number of servers, allowing for minimal remapping of keys when servers are added or removed.
The Problem with Traditional Hashing
In distributed systems, we often need to determine which server should store or process a particular piece of data. The traditional approach uses a simple modulo operation:
server_index = hash(key) % number_of_servers
However, this approach has a major drawback: when the number of servers changes (a server is added or removed), almost all keys get reassigned to different servers. This causes massive data migration and can lead to performance issues.
Traditional Hashing Demo
See how keys get redistributed when servers are added or removed using traditional hashing.
Traditional hashing visualization loading...
Drag keys to rearrange the visualization
Consistent Hashing Algorithm
Consistent hashing solves this redistribution problem by arranging both servers and keys on a hash ring (or circle). Each key is assigned to the nearest server going clockwise around the ring.
With this approach, when a server is added or removed, only the keys that were assigned to that server and some keys from the next server need to be redistributed. This significantly reduces the amount of data that needs to be moved.
To improve distribution, each physical server is represented by multiple points on the ring, called virtual nodes. This ensures a more even distribution of keys among servers.
Consistent Hashing Demo
Explore how consistent hashing works by adding/removing servers and keys. Toggle virtual nodes to see their effect on distribution.
Consistent hashing visualization loading...
Drag keys to rearrange the visualization
Real-world Applications
Consistent hashing is used in many distributed systems where data needs to be efficiently partitioned across multiple servers:
- Distributed Caches - Memcached and Redis Cluster use consistent hashing to distribute cached data across multiple nodes.
- Distributed Databases - Systems like Apache Cassandra and Amazon DynamoDB use consistent hashing for data partitioning.
- Content Delivery Networks (CDNs) - Akamai and Cloudflare use consistent hashing to determine which edge server should handle a request.
- Load Balancers - HAProxy and other load balancers can use consistent hashing to maintain user session affinity.
The key advantage in all these systems is minimizing data movement when the system topology changes, allowing for seamless scaling and high availability even during server additions or failures.
Key Takeaways
-
Problem with Traditional Hashing: Using
hash(key) % n
causes massive data redistribution when servers are added or removed. - Consistent Hashing Solution: Maps both keys and servers to a ring, assigning each key to the nearest server clockwise.
- Minimal Redistribution: When a server is added/removed, only keys from that server and some from the next server need to be moved.
- Virtual Nodes: Multiple points on the ring for each physical server ensure more balanced data distribution.
- Scalability: Enables horizontal scaling with minimal disruption, allowing systems to grow or shrink dynamically.