There is a long history of clustering architectures with respect to building distributed databases for two primary reasons. The first is scalability. If a cluster of nodes has reached its capacity to perform work, adding additional nodes are introduced to handle the increased load. The second is availability. The ability to ensure that if a node fails, let’s say during ingestion and/or querying, remaining nodes would continue to execute due to state replication. And yet such strategies are from an earlier age, where a node’s state (i.e. local compute, memory, storage) is synchronized amongst one or more nodes in the cluster. In other words, the first-generation scaling databases were really individual databases connected together over some synchronization protocol. Since then, many things have changed. The introduction of cloud computing (i.e. on-demand), as well as cloud object storage (i.e. distributed persistence) has changed how both scalability and availability can be architected. Let’s explain.
As mentioned, there is a common definition on how workloads of first-generation databases are architected with respect to scalability. Each node within a computer cluster works within a group but executes in isolation and synchronizes state amongst its peers within a quorum. A key part of database state synchronization is division of work across the cluster. In other words, concepts like partitioning data into shards during ingestion, as well as querying those shards, is a major construct in such architectures.
The procedures of adding/removing nodes in a quorum, as well as the actual sharding across a cluster, has a history of complexity. In the case of ingestion, load balancing data into partitions via sharding can be relatively straight forward. However, it certainly increases the complexity of scaling, as well as querying. As a result, partition keys and its criteria are selected by database administrators to achieve as much balance as possible. Even if balance is achieved, query load could be so high that an individual primary responsible for a particular shard can not keep up such that a secondary read-replicate is employed to offload. There are several options for primary/secondary configurations within a cluster and are well beyond the scope of this document. However, it should be noted that the larger and more complex the dataset is, the more complex is the partitioning, clustering and runtime support.
As one can imagine, the years of managing solutions built from such thinking, to say the least, is problematic at best and not flexible or agile. We chaosians experienced such legacy architecture and wanted to do better.
Read: How BAI Communications Scaled Log Analytics to Optimize Network Performance
There are two major aspects to a second-generation distributed database. The first is leveraging distributed storage and the second is implementing a serverless architecture. Such a foundation allows for a shared everything design (which we did, yet did not stop there). We like to say here at Chaos we designed/implemented the truly first third-generation distributed database. Let’s explain.
A third-generation architecture is based on second-generation concepts, but takes it that much further. It is based on both serverless and shared everything design, but takes it one step further with stateless. The idea that any node within a quorum can instantly and independently work on any aspect of a database workload at any time. The idea is that nodes don’t communicate state directly per say, but leverages strongly consistent and distributed storage such as Amazon S3 or Google GCS as point of synchronization, as well as deterministic controls / procedures. In other words, cloud object storage provides both a location to store data, as well as synchronize state.
To conceptualize this definition, a third-generation architecture should be viewed as a distributed operating system, where there is one storage component and one compute component (though backed by multiple physical resources). In other words, compute is part of one "logical" Central Processing Unit (CPU) where a node is sliced up into workers (i.e. processes) that are scheduled by the operating system to perform work. Each worker has access to any and all aspects of the storage component (i.e. service). And since cloud object storage is conceptually infinite storage, and scaling a cluster of nodes (via autoscaling Amazon EC2 or Google Compute) is quantifiable deterministic, Q.E.D such architectures are infinitely scalable. Even networks are partitioned via linear and isolated compute allocation (i.e. instance selection) to scale capacity. In other words, more compute capacity equals more network capacity.
All in all, a third-generation architecture scales only/simply by adding compute capacity, whether for ingestion and/or querying load. At ChaosSearch, customers have scaled from 50 terabytes a day to 250 terabytes by a click of a mouse or service notification. This is certainly not the case for first or second generation architectures. In other words, traditionally events like Black Friday are something to plan for, test for, and worried about for weeks, if not months. Not for third-generation.
Read: Leveraging Amazon S3 Cloud Object Storage for Analytics
There are standard conventions with regards to first-generation databases and its availability. A key aspect of availability is scalability, where the previous section should be examined, and will not be repeated here. However, primary/secondary (i.e. active-active / active-passive) methods of clustering should be understood. The key takeaway is that data is not available until and as long as two or more nodes in the cluster have received the data, persisted it, and the quorum agrees. Such architectures generally use a shared nothing design with its inherent synchronization congestion and complexity. And as load increases, due to ingestion and/or querying, failure will frequently be close behind. Getting this combination of configuration, partitioning, and resource sizing right, is daunting to say the least.
To get around such failures, often systems are over provisioned to reduce the need to materially scale in the future. Although this might decrease potential failures, it also inflates the cost due to additional (usually dormant) resources and administration support. As a result, a new generation of architecture has been developed with a shared everything philosophy. This second-generation of databases removes the congestion and complexity required to keep synchronization state. Such architectures often leverage a serverless design (for the most part - see caching) where node read-replica can easily be added so that a primary is not overwhelmed with queries, thus reducing overall risk of failure. In the case of active-active, there is still ingestion synchronization, but now limited. And with a good sharding scheme, a workable option for availability.
And yet, building and managing solutions built from such thinking, to say the least, is still fragile and problematic. We chaosians experienced such legacy architecture and wanted to do better.
As previously indicated, first/second generation architectures are as good as the last accessible node. As a result, ensuring nodes are up and running is critical. Configuring active-active and/or active-passive clustering to run across availability zones and certainly across regions with cloud providers such as AWS or GCP is essential. Failure to do so will certainly result in either loss of data on ingest or query. The nightmare is the continuous designing, deploying, and monitoring cycle due to some physical or logical failure. Not to mention the cost of such a highly available architecture. Typically costs are three times or more than standalone deployments.
And here is where we at ChaosSearch went out to create the truly first third-generation database. A design that uses the best-of-bread services like cloud object storage and autoscale compute. An architecture that adopts shared everything, as well as serverless and stateless design. Because by doing so, availability of ingestion/queries are greatly simplified and cost are significantly reduced. Let me explain.
One of the key limiting factors in availability is ingestion. If a primary node is not up and running, ready to receive, loss of data is a certainty. In previous generations, having two or more primary nodes is a must, and where the cost and complexity comes in, not to mention the need for pipelining like Kafka. However, if cloud object storage is the primary entry point to ingestion, there is no need for a primary node or pipeline buffering. In other words, cloud object storage has all the capabilities (and scalability with availability) to receive data. And each of the major cloud provider offerings are naturally available across availability zones (no work or cost). And coming this year, both Amazon S3 and Google GCS will have region level availability. What this means is that leveraging cloud object storage for data ingestion (i.e. queue) for a next generation database, is the simplest and most cost-effective way to achieve both zone and region availability, full stop.
The reason ChaosSearch can solely use cloud object storage for ingestion, is that we purposely and fundamentally, designed in its queuing/notification capabilities (i.e. Amazon SQS or Google PubSub) such that once data lands (via strongly consistent, durable, resilient, and secure attributes), there is nothing else for the system to do where such data is queryable within moments. In other words, this storage service naturally provides availability without all the time, cost and complexity required in first/second generation database architecture. And since it is dramatically less complicated than an active-active or active-passive design, and since it requires no external compute for persistence, it can certainly be argued it is even more reliable due to its simplicity.
Now with that said, the actual indexing of ingestion to be queried does need compute, but there is no requirement for active-active or active-passive synchronization and associated cost or complexity. With a ChaosSearch serverless and stateless architecture, a node can do work (index or query) such that a simple scheduling model can be utilized. In other words, if data is put in cloud object storage, work is scheduled to index the data. Once this data is indexed (full representation), it is also stored in cloud object storage where the queue notification service is finally acknowledged. If a worker should die during this time, the scheduler just reschedules this task to another available worker. If a worker and/or node dies, work is scheduled such that indexing is available until the last worker is active. And if there are no workers, ingestion is still available due to cloud object storage. This is extremely more resilient than a classic limited multi-node first/second generation cluster.
And finally on third-generation query availability as mentioned, requests can be serviced as long as there is at least one worker active. A key aspect of this operating system style architecture is where compute is leased (let’s say for indexing), but also stolen for (let’s say for queries) and vice versa; for scheduling is asynchronous and dynamic since workers can be terminated without affecting a tasks' overall availability. Workers can also be prioritized, pooled, and elastically allocated (ask about our compute farm).
In the case where availability is required across regions, there are two ways it can be supported. The first option is to send data to two or more regions (active-active). The second and probably the most preferable, is sending to one region (active), index there then store in more than one region (passive), where active regions uniquely can be passive for other actives. Now, it should be noted that it is not recommended to use native bucket replication due to inherent latency. Both Active-active & active-passive are considered hot, where bucket replication would be considered cold. And finally, there are unique options with active deploys where index is not replicated, but compute fails over to support hot queries. This is a distinction of active-passive where a zone in one region is failed over to another region zone (see zone available below), when multi-region available is a requirement.
Zone availability is a bit less obvious. Cloud providers typically create three zones within a region where resources and power is isolated. A question might be asked, does one need workers in each zone if cloud object storage by default is available in each zone within a region. The answer is, it really depends on requirements. When it comes to ingestion availability, there is “no” difference, but in the case of queries there certainly is, and it all comes down to cost and time to query. If compute is in all zones, time to query is hot up to the last zone available within a region. If availability has been set up cross region (active-passive), only one zone needs to be available since the failover region can take up the slack while the failed from region redeploys to the remaining zone. This process to failover to a region is "hot" and once the new zone is running, the failover region can fail back hot as well. At the surface this would seem to be the best option for availability and cost when multi-region availability is leveraged. The only drawback is the performance of using compute in the "failover region" to the “failed from” cloud object storage. Typically, deploying to a zone takes upwards of ten minutes with actual compute taking upwards of five for a total of fifteen. Not too bad to save two to three times the cost for a fifteen minute window of slower execution. And to reiterate, deploying to every zone, where compute might be dormant, achieves no availability benefits, for a lot more cost.
Read: AWS vs GCP: Top Cloud Services Logs to Watch and Why
It has been shown that cloud object storage supports scalability and availability requirements for a simpler and more supportable architecture at reduction in cost. Our own customer experience has shown that this third-generation database achieves both scale with performance at tenth the cost compared to traditional first/second generation solutions. When storage is the “enabler” for such scalability and availability, many wonderful architectural decisions can be made.
Read: The 7 Costly and Complex Challenges of Big Data Analytics