Wednesday, November 24, 2010

Consistent Hashing

Hashing is a common method of mapping a key to a location. This is useful for many things, but in more relevant terms, this can be used to map keys to a server with great effect. Simple hashing use a Key mod N algorithm, where K is the number of keys and N is the number of slots or servers. This ensures that keys are mapped evenly across N slots. The problem with this algorithm is that adding or removing a slot or server would require a complete rehash of all the keys. And in case of huge data set, it is ideally not feasable to rehash and re-distribute the keys.

Consistent Hashing is a specific implementation of hashing that is well suited for many of today’s web-scale load balancing problems. Consistent Hashing is used particularly because it provides a solution for the typical “K mod N” method of distributing keys across a series of servers. It does this by allowing servers to be added or removed without significantly upsetting the distribution of keys, nor does it require that all keys be rehashed to accommodate the change in the number of servers. When using consistent hashing, only K/N keys need to be remapped on average.

Implementing Consistent Hashing is done by mapping keys and servers onto edge of a circle. All servers are mapped on to a series of angles around a circle. Each key is also hashed onto the circle. Each hashed server contains all keys between itself and the next clock-wise server hashed onto the circle. The bucket where each item should be stored is chosen by selecting the next highest angle which an available bucket maps to. So, each bucket contains resources mapping to an angle between it and the next smallest angle. If a bucket becomes unavailable, the keys being mapped to that bucket get mapped to the next highest bucket (or the next bucket in the circle). So, only keys which were in the bucket which became unavailable is lost. Similarly when a bucket is added, the keys between the new bucket and the next smallest bucket is mapped to the new bucket. Keys which should be associated with the new bucket and were stored previously will become unavailable.

As shown in Figure 1, Keys 1, 2, 3 and 4 map to slots A, B and C. To find which slot a key goes in, we move around the circle until we find a slot. So here key 1 goes into slot A, 2 goes into slot B and 3 goes into slot C, key 4 goes into slot A again. If C is removed, key 3 would belong to slot A.

If another slot D is added as shown in Figure 2, it will take keys 3 and 4 and only leave key 1 belonging to A.

Fig 1: Keys distribution with Consistent hashing
Fig 2: Keys re-distribution with Consistent hashing

Tuesday, November 23, 2010

Database Consistency

A database system is said to be in a consistent state if it satisfies all known integrity constraints. Integrity is defined as `the accuracy or correctness of data in the database`. There are two famous types of integrity constraints:

1. Entity integrity constraints (for example, no primary key value can be null)
2. Referential integrity constraints (a field in one table which refers to another table must refer to a field that exists in that table).

A database is in a correct state if it is both consistent and if it accurately reflects the true state of affairs in the real world. A database that is in a correct state will always be consistent. But consistent does not necessarily mean correct. A consistent database can be incorrect.

Different types of consistency exists:

Strong consistency means, that all processes connected to the database will always see the same version of a value and a committed value is instantly reflected by any read operation on the database until it is changed by another write operation.

Eventual Consistency is weaker and does not guarantee that each process sees the same version of the data item. Even the process which writes the value could get an old version during the inconsistency window. This behavior is usually caused by the replication of the data over different nodes.

Read-your-own-writes consistency, some distributed databases can ensure that a process can always read its own writes. For this, the database has to connect the same process always to nodes that already store the data written by this process.

A subtype of read-your-writes consistency is session consistency. Thereby it is only guaranteed that a process can read its own written data during a session. If the process starts a new session, it might see an older value during the inconsistency window.

Another variant of eventual consistency is monotonic read consistency, which assures that when a newly written value is read the first time, all subsequent reads on this data item will not return any older values. This type of consistency allows the database to replicate newly written data, before it allows the clients to see the new version.

Most of the NoSQL databases can only provide eventual consistency.