Saturday, August 19, 2017

Kinetics of Large Distributed Clusters

Summary

Martin Kleppmann's fatal mistake.
Physicochemical kinetics does mathematics.
The half-life of the cluster.
We solve nonlinear differential equations without solving them.
Nodes as a catalyst.
The predictive power of graphs.
100 million years.
Synergy.

In the previous article, we discussed in detail Brewer's article and Brewer's theorem. This time we will analyze the post of Martin Kleppmann "The probability of data loss in large clusters".

In the mentioned post, the author attempts to simulate the following task. To ensure the preservation of data, the data replication method is usually used. In this case, in fact, it does not matter whether erasure is used or not. In the original post, the author sets the probability of dropping one node, and then raises the question: what is the probability of data loss when the number of nodes increases?

The answer is shown in this picture:

Data loss

Sunday, August 13, 2017

Latency of Geo-Distributed Databases

Theorem 0. The minimum guaranteed latency for the globally highly available strong consistency database is 133 ms.

Earth

1 Abstract

The article introduces step-by-step the formal vocabulary and auxiliary lemmas and theorems to prove the main theorem 0. Finally, as a consequence, the CAL theorem is formulated.

2 Introduction

Modern applications require intensive work with huge amount of data. It includes both massive transactional processing and analytical research. As an answer to the current demand, the new generation of databases appears: NewSQL databases. Those databases provide the following important characteristics: horizontal scalability, geo-availability, and strong consistency.

NewSQL era opens new possibilities to store and process so called Big Data. At the same time, the important question appears: "how fast the databases might be?". It is very challenging task to improve the performance and latency parameters because it involves almost all layers while building the databases: from hardware questions about data centers connectivity and availability to software sophisticated algorithms and architectural design.

Thus, we need to understand the degree of latency optimizations and corresponding limitations that we have to deal with. The article tries to find answers to that challenge.