1 Introduction
The disaster recovery and business continuity plan on-premise is a very tedious and expensive task and sometimes takes years to fully implement based on its complexity. It involves having geographically separated secondary data centers, networking between data centers, tools and processes, and developing and implementing DR solutions. And also required a thorough business continuity plan to execute in case of failures.
Thanks to the emerging big data technologies and cloud, you can achieve all of them together if the toolset is properly designed and the new concepts are fully utilized.
For example, managed instance groups which club together groups of VM instances as a single unit provides high availability and autoscaling together and can solve DR and BCP requirements.
Certain managed services like BigQuery and Pub/Sub satisfies DR and BCP requirements without any extra effort. Certain resources like Cloud storage require minimum effort to implement. Cloud SQL and VM Instances require a lot more effort to make it compliance with DR and BCP.
The DR and BCP can be implemented for certain resources like dataflow and dataproc through continuous delivery (CD) processes.
This blog outlines different strategies to fully implement DR and BCP across the toolset and resources you are currently using and probably will use in near future on GCP.
2 Key terminology
Here are some key concepts that will go hand-in-hand with disaster recovery and business continuity plans.
2.1 Availability
Availability is a measure of the time that services are functioning correctly and accessible to users. Availability requirements are typically stated in terms of percent of time a service should be up and running, such as 99.99 percent.
2.2 Reliability
Reliability is a closely related concept to availability. Reliability is a measure of the probability that a service will continue to function under some load for a period of time. The level of reliability that a service can achieve is highly dependent on the availability of systems upon which it depends.
2.3 Scalability
Scalability is the ability of a service to adapt its infrastructure to the load on the system. When load decreases, some resources may be shut down. When load increases, resources can be added. Autoscalers and instance groups are often used to ensure scalability when using Compute Engine.
2.4 Durability
Durability is used to measure the likelihood that a stored object will be retrievable in the future. Cloud Storage has 99.999999999 percent (eleven 9s) durability guarantees, which means it is extremely unlikely that you will lose an object stored in Cloud Storage. Because of the math, as the number of objects increases, the likelihood that one of them is lost will increase.
Even Though all four concepts mentioned above go together in most cases, this blog mainly focuses on Availability and Durability.
Here are some more concepts of interest that may help understanding DR and BCP.
2.5 Load Balancing
Google Cloud offers server-side load balancing so you can distribute incoming traffic across multiple virtual machine (VM) instances. Load balancing provides the following benefits:
● Scale the application.
● Support heavy traffic
● Detect and automatically remove unhealthy VM instances using health checks. Instances that become healthy again are automatically re-added.
● Route traffic to the closest virtual machine
Google Cloud load balancing is a managed service, which means its components are redundant and highly available. If a load balancing component fails, it is restarted or replaced automatically and immediately.
2.6 Autoscaling
Compute Engine offers autoscaling to automatically add or remove VM instances from an instance group based on increases or decreases in load. Autoscaling lets our apps gracefully handle increases in traffic, and it reduces cost when the need for resources is lower. After you define the autoscaling policy, the autoscaler performs automatic scaling based on the measured load.
3 Locations, Regions and Zones
Google Cloud services are available in locations across North America, South America, Europe, Asia, and Australia. These locations are divided into regions and zones. You can choose where to locate our applications to meet the latency, availability, and durability requirements.
3.1 Locations
GCP resources are hosted in multiple locations worldwide. These locations are composed of regions and zones.
3.2 Regions
Regions are independent geographic areas that consist of zones. Locations within regions tend to have round-trip network latencies of under <1ms on the 95th percentile.
3.3 Zones
A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy the applications across multiple zones in a region.
To protect against the loss of an entire region due to natural disaster, you need to deploy all critical applications and mainly databases into multiple regions and where you can not deploy applications into multiple regions, you need to have a plan on how you can bring those applications on different regions.
As of March 2020, the GCP has following regions and zones within the USA.
● Region: us-central1
○ Zone: us-central1-a
○ Zone: us-central1-b
○ Zone: us-central1-c
○ Zone: us-central1-f
Location: Council Bluffs, Iowa, USA
● Region: us-west1
○ Zone: us-west1-a
○ Zone: us-west1-b
○ Zone: us-west1-c
Location: The Dalles, Oregon, USA
● Region: us-west3
○ Zone: us-west3-a
○ Zone: us-west3-b
○ Zone: us-west3-c
Location: Salt Lake City, Utah, USA
● Region: us-west2
○ Zone: us-west2-a
○ Zone: us-west2-b
○ Zone: us-west2-c
Location: Los Angeles, California, USA
● Region: us-east4
○ Zone: us-east4-a
○ Zone: us-east4-b
○ Zone: us-east4-c
Location: Ashburn, Northern Virginia, USA
● Region: us-east1
○ Zone: us-east1-b
○ Zone: us-east1-c
○ Zone: us-east1-d
Location: Council Bluffs, Iowa, USA
As per GCP, the following is the availability of a resource when implemented within a single zone, multiple zones (Regional) and in multiple regions (global).
Percentage Uptime Geography
99.9% Single Zone
99.99% Single region
99.999% Globally
The above availability can be translated into time as below:
Percent Uptime Downtime/Day Downtime/Week Downtime/Month
99.00 14.4 minutes 1.68 hours 7.31 hours
99.90 1.44 minutes 10.08 minutes 43.83 minutes
99.99 8.64 seconds 1.01 minutes 4.38 minutes
99.999 864 milliseconds 6.05 seconds 26.3 seconds
99.9999 86.4 milliseconds 604.8 milliseconds 2.63 seconds
4 GCP Resource types and disaster recovery
This section will explain how DR and BCP can be implemented with various GCP resource types.
4.1 Compute Engine Instances
A single VM instance can only be created in a single zone. High availability in Compute Engine is ensured by several different mechanisms and practices.
When creating VM instances, one of the following methods should be used to protect it from DR scenarios if that VM is critical for business continuity. The method you chose is based on many factors including criticality of the resource to business and SLAs involved in bringing back that resource in case of failures.
4.1.1 Hardware Redundancy and Live Migration
At the physical hardware level, the large number of physical servers in the GCP provide redundancy for hardware failures. If a physical server fails, others are available to replace it. Google also provides live migration, which moves VMs to other physical servers when there is a problem with a physical server or scheduled maintenance has to occur. Live migration is also used when network or power systems are down, security patches need to be applied, or configurations need to be modified.
There are certain limitations with Live migration. Currently, Live migration is not available for preemptible VMs, however, but preemptible VMs are not designed to be highly available. VMs with GPUs attached are not available to live migrate. Constraints on live migration may change in the future.
4.1.2 Managed Instance Groups
High availability also comes from the use of redundant VMs. Managed instance groups are the best way to create a cluster of VMs, all running the same services in the same configuration. A managed instance group uses an instance template to specify the configuration of each VM in the group. Instance templates specify machine type, boot disk image, and other VM configuration details.
If a VM in the instance group fails, another one will be created using the instance template.
Managed instance groups (MIGs) provide other features that help improve availability. A VM may be operating correctly, but the application running on the VM may not be functioning as expected. Instance groups can detect this using an application-specific health High Availability 53 check. If an application fails the health check, the managed instance group will create a new instance. This feature is known as auto-healing.
Managed instance groups use load balancing to distribute workload across instances. If an instance is not available, traffic will be routed to other servers in the instance group. Instance groups can be configured as regional instance groups. This distributes instances across multiple zones. If there is a failure in a zone, the application can continue to run in the other zones.
4.1.3 Multiple Regions and Global Load Balancing
Beyond the regional instance group level, you can further ensure high availability by running the application in multiple regions and using a global load balancer to distribute workload.
This would have the added advantage of allowing users to connect to an application instance in the closest region, which could reduce latency.
You would have the option of using the HTTP(S), SSL Proxy, or TCP Proxy load balancers for global load balancing.
Always use globally persisted storage or a database like Big Query to store critical data or information so that when you redeploy the VM Instance in different regions, you can attach the storage back. This will protect from loss of the data.
4.2 Kubernetes Engine
Kubernetes Engine is a managed Kubernetes service that is used for container orchestration. Kubernetes is designed to provide highly available containerized services.
VMs in a GKE Kubernetes cluster are members of a managed instance group, and so they have all of the high availability features described previously.
Kubernetes continually monitors the state of containers and pods. Pods are the smallest unit of deployment in Kubernetes; they usually have one container, but in some cases a pod may have two or more tightly coupled containers. If pods are not functioning correctly, they will be shut down and replaced. Kubernetes collects statistics, such as the number of desired pods and the number of available pods, which can be reported to Stackdriver.
By default, Kubernetes Engine creates a cluster in a single zone. To improve availability, you can create a regional cluster in GKE, the managed service that distributes the underlying VMs across multiple zones within a region. GKE replicates masters and nodes across zones. This provides continued availability in the event of a zone failure. The redundant masters allow the cluster is a managed Kubernetes service that is used for container orchestration. By default, Kubernetes Engine creates a cluster in a single zone.
If you are using any storage for stateful applications, always use globally persisted storage or a database like Big Query to store stateful information so that when you redeploy the Kubernetes cluster in different regions and can attach the storage back. This will protect from loss of the data.
4.3 App Engine and Cloud Functions
App Engine and Cloud Functions are fully managed compute services. Users of these services are not responsible for maintaining the availability of the computing resources. The Google Cloud Platform ensures the high availability of these services. Of course, App Engine and Cloud Functions applications and functions may fail and leave the application unavailable.
The DR and BCP can be maintained using CICD workflows for App Engine and Cloud functions. As they are not associated with data, these can be redeployed in just a matter of minutes.
4.4 Big Query and other distributed databases
BigQuery is by default deployed on USA multi-regional, so this will meet our DR and BCP requirements without any extra effort.
Other distributed databases like Spanner, BigTable and DataStore are globally distributed and no additional effort is required if you use these resources.
4.5 Cloud SQL
CloudSQL is a managed service for MySql and PostgreSql. CloudSQL supports Zonal and Multi-Zonal (Regional) implementations.
CloudSQL does not support multi-Regional at this point. This can be achieved building CloudSQL instances in multiple regions and by setting up replication. This can be also achieved by backups and log shipping onto global storage. Based on SLAs and criticality, one of the methods can be implemented if you need multi-regional redundancy.
4.6 Cloud Storage and Persistent disks
Persistent disks are faster compared to cloud storage. Cloud storage is good for Object storage needs like data lake. Persistent disks support block storage and even SSDs.
Both Cloud storage and Persistent disks support Zonal, Regional and Global high availability.
If the data or information stored is very critical and needs to be retained, use multi-regional. Use multi-regional for all production and critical data needs.
Use regional for non prod and non critical needs.
Use nearline or coldline for backups, archive data storage needs which are not accessed very frequently.
4.7 Pub/Sub
Pub/Sub is managed google service for streaming and available as a global service. No extra effort needed for DR and BCP requirements.
4.8 Dataflow
Dataflow resources can be created as Zonal or Regional. Dataflow pipelines are currently deployed through CICD pipelines. In case of regional failures, the resources can be redeployed using continuous deployment(CD) pipelines onto different regions.
It's important that you will have proper CI/CD pipelines developed and automated for all the dataflow jobs.
4.9 Dataproc
Dataproc resources currently support only Zonal. Like Dataflow, you need to have continuous deployment(CD) pipelines to deploy any dataproc resources in case of unavailability of a particular zone/region.
Use multi-regional persistent disks or cloud storage if you need to retain any data.
4.10 Composer
Composer supports only Zonal currently. DR and BCP requirements can be met using proper fully tested CICD pipelines.
5 GCP DR and BCP matrix
The following table explains how each resource can be made to support DR and BCP requirements when required.
6 Considerations and Assumptions
Even though this blog touches some network aspects, it doesn't fully cover all network requirements like having redundant networks between on-premise and GCP, having multiple interconnects if you are using vendor provided interconnects and so on.
This blog also doesn’t fully cover OLTP database needs and other applications beyond the scope of the data lake as I don't have much insight into those projects.
This blog focuses on GCP regions as different geographically separated data centers and assumes GCP is sufficient for our DR and BCP needs. It does not cover the entire GCP shutdown for whatever the reasons, that's where the multi-cloud approach comes into picture.
7 Pricing and other costs
As far as high availability is concerned, global or multi-regional provide far superior coverage compared to Regional and Zonal. Regional provides much high availability compared to zonal.
But from a cost perspective, global or multi-regional is far more expensive than regional and zonal. Regional is more expensive than zonal. Sometimes, costs might be more than doubled.
For example, to make VM instances redundant, you need to use 2 instances at least and you need a managed instance group(MIG) service and also there will be network changes to keep both instances in sync. The costs will be more than double in this case.
But for Kubernetes, it only adds a master server to make it more redundant in case of regional implementation, the cost will go up marginally but a lot less than two times.
But in case of managed services like BigQuery, Pub/Sub, App Engine, it will not incur any additional costs.
For Cloud storage the costs are different, if you store data in multi-regional, regional, coldline and nearline tiers. Multi-regional is more expensive and coldline and nearline are cheaper. Sometimes, the cost difference is huge.
When you design and build resources, you have to keep pricing, SLAs and DR requirements in mind. Some solutions may satisfy the needs and are also cheaper.
Besides pricing and other costs, there are also latencies that need to be kept in mind. Intra-zonal and intra-regional trips add a few milliseconds latencies to the application response times. Different layers of applications like front-end, back-end and database need to be aligned properly for better performance and to reduce network costs.