Blog Post

Disaster Recovery and Business Continuity Plan on google cloud

Srinivasa Rao • April 23, 2020

How to achieve DR and BCP on GCP and other clouds?

1 Introduction

The disaster recovery and business continuity plan on-premise is a very tedious and expensive task and sometimes takes years to fully implement based on its complexity. It involves having geographically separated secondary data centers, networking between data centers, tools and processes, and developing and implementing DR solutions. And also required a thorough business continuity plan to execute in case of failures.

Thanks to the emerging big data technologies and cloud, you can achieve all of them together if the toolset is properly designed and the new concepts are fully utilized.

For example, managed instance groups which club together groups of VM instances as a single unit provides high availability and autoscaling together and can solve DR and BCP requirements. 

Certain managed services like BigQuery and Pub/Sub satisfies DR and BCP requirements without any extra effort. Certain resources like Cloud storage require minimum effort to implement. Cloud SQL and VM Instances require a lot more effort to make it compliance with DR and BCP. 

The DR and BCP can be implemented for certain resources like dataflow and dataproc through continuous delivery (CD) processes.

This blog outlines different strategies to fully implement DR and BCP across the toolset and resources you are currently using and probably will use in near future on GCP.

2 Key terminology
Here are some key concepts that will go hand-in-hand with disaster recovery and business continuity plans.

2.1 Availability
Availability is a measure of the time that services are functioning correctly and accessible to users. Availability requirements are typically stated in terms of percent of time a service should be up and running, such as 99.99 percent.

2.2 Reliability
Reliability is a closely related concept to availability. Reliability is a measure of the probability that a service will continue to function under some load for a period of time. The level of reliability that a service can achieve is highly dependent on the availability of systems upon which it depends.

2.3 Scalability
Scalability is the ability of a service to adapt its infrastructure to the load on the system. When load decreases, some resources may be shut down. When load increases, resources can be added. Autoscalers and instance groups are often used to ensure scalability when using Compute Engine. 

2.4 Durability
Durability is used to measure the likelihood that a stored object will be retrievable in the future. Cloud Storage has 99.999999999 percent (eleven 9s) durability guarantees, which means it is extremely unlikely that you will lose an object stored in Cloud Storage. Because of the math, as the number of objects increases, the likelihood that one of them is lost will increase. 

Even Though all four concepts mentioned above go together in most cases, this blog mainly focuses on Availability and Durability.

Here are some more concepts of interest that may help understanding DR and BCP.

2.5 Load Balancing
Google Cloud offers server-side load balancing so you can distribute incoming traffic across multiple virtual machine (VM) instances. Load balancing provides the following benefits:
● Scale the application.
● Support heavy traffic
● Detect and automatically remove unhealthy VM instances using health checks. Instances that become healthy again are automatically re-added.
● Route traffic to the closest virtual machine
Google Cloud load balancing is a managed service, which means its components are redundant and highly available. If a load balancing component fails, it is restarted or replaced automatically and immediately.

2.6 Autoscaling
Compute Engine offers autoscaling to automatically add or remove VM instances from an instance group based on increases or decreases in load. Autoscaling lets our apps gracefully handle increases in traffic, and it reduces cost when the need for resources is lower. After you define the autoscaling policy, the autoscaler performs automatic scaling based on the measured load.

3 Locations, Regions and Zones
Google Cloud services are available in locations across North America, South America, Europe, Asia, and Australia. These locations are divided into regions and zones. You can choose where to locate our applications to meet the latency, availability, and durability requirements.
3.1 Locations
GCP resources are hosted in multiple locations worldwide. These locations are composed of regions and zones.

3.2 Regions
Regions are independent geographic areas that consist of zones. Locations within regions tend to have round-trip network latencies of under <1ms on the 95th percentile.

3.3 Zones
A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy the applications across multiple zones in a region.

To protect against the loss of an entire region due to natural disaster, you need to deploy all critical applications and mainly databases into multiple regions and where you can not deploy applications into multiple regions, you need to have a plan on how you can bring those applications on different regions.

As of March 2020, the GCP has following regions and zones within the USA. 
● Region: us-central1
○ Zone: us-central1-a
○ Zone: us-central1-b
○ Zone: us-central1-c
○ Zone: us-central1-f
   Location: Council Bluffs, Iowa, USA
● Region: us-west1
○ Zone: us-west1-a
○ Zone: us-west1-b
○ Zone: us-west1-c
 Location: The Dalles, Oregon, USA
● Region: us-west3
○ Zone: us-west3-a
○ Zone: us-west3-b
○ Zone: us-west3-c
 Location: Salt Lake City, Utah, USA
● Region: us-west2
○ Zone: us-west2-a
○ Zone: us-west2-b
○ Zone: us-west2-c
 Location: Los Angeles, California, USA
● Region: us-east4
○ Zone: us-east4-a
○ Zone: us-east4-b
○ Zone: us-east4-c
Location: Ashburn, Northern Virginia, USA
● Region: us-east1
○ Zone: us-east1-b
○ Zone: us-east1-c
○ Zone: us-east1-d
Location: Council Bluffs, Iowa, USA

As per GCP, the following is the availability of a resource when implemented within a single zone, multiple zones (Regional) and in multiple regions (global).

Percentage Uptime             Geography
       99.9%                           Single Zone
       99.99%                         Single region
       99.999%                       Globally

The above availability can be translated into time as below:

Percent Uptime         Downtime/Day                Downtime/Week             Downtime/Month 
     99.00                        14.4 minutes                       1.68 hours                      7.31 hours 
     99.90                        1.44 minutes                        10.08 minutes             43.83 minutes 
     99.99                         8.64 seconds                       1.01 minutes                  4.38 minutes 
     99.999                       864 milliseconds                  6.05 seconds               26.3 seconds 
     99.9999                    86.4 milliseconds                  604.8 milliseconds       2.63 seconds

4 GCP Resource types and disaster recovery
This section will explain how DR and BCP can be implemented with various GCP resource types.

4.1 Compute Engine Instances
A single VM instance can only be created in a single zone. High availability in Compute Engine is ensured by several different mechanisms and practices.

When creating VM instances, one of the following methods should be used to protect it from DR scenarios if that VM is critical for business continuity. The method you chose is based on many factors including criticality of the resource to business and SLAs involved in bringing back that resource in case of failures.

4.1.1 Hardware Redundancy and Live Migration
At the physical hardware level, the large number of physical servers in the GCP provide redundancy for hardware failures. If a physical server fails, others are available to replace it. Google also provides live migration, which moves VMs to other physical servers when there is a problem with a physical server or scheduled maintenance has to occur. Live migration is also used when network or power systems are down, security patches need to be applied, or configurations need to be modified. 
There are certain limitations with Live migration. Currently, Live migration is not available for preemptible VMs, however, but preemptible VMs are not designed to be highly available. VMs with GPUs attached are not available to live migrate. Constraints on live migration may change in the future. 

4.1.2 Managed Instance Groups
High availability also comes from the use of redundant VMs. Managed instance groups are the best way to create a cluster of VMs, all running the same services in the same configuration. A managed instance group uses an instance template to specify the configuration of each VM in the group. Instance templates specify machine type, boot disk image, and other VM configuration details. 

If a VM in the instance group fails, another one will be created using the instance template. 

Managed instance groups (MIGs) provide other features that help improve availability. A VM may be operating correctly, but the application running on the VM may not be functioning as expected. Instance groups can detect this using an application-specific health High Availability 53 check. If an application fails the health check, the managed instance group will create a new instance. This feature is known as auto-healing. 

Managed instance groups use load balancing to distribute workload across instances. If an instance is not available, traffic will be routed to other servers in the instance group. Instance groups can be configured as regional instance groups. This distributes instances across multiple zones. If there is a failure in a zone, the application can continue to run in the other zones.

4.1.3 Multiple Regions and Global Load Balancing 
Beyond the regional instance group level, you can further ensure high availability by running the application in multiple regions and using a global load balancer to distribute workload. 

This would have the added advantage of allowing users to connect to an application instance in the closest region, which could reduce latency. 

You would have the option of using the HTTP(S), SSL Proxy, or TCP Proxy load balancers for global load balancing.

Always use globally persisted storage or a database like Big Query to store critical data or information so that when you redeploy the VM Instance in different regions, you can attach the storage back. This will protect from loss of the data.

4.2 Kubernetes Engine
Kubernetes Engine is a managed Kubernetes service that is used for container orchestration. Kubernetes is designed to provide highly available containerized services. 

VMs in a GKE Kubernetes cluster are members of a managed instance group, and so they have all of the high availability features described previously. 

Kubernetes continually monitors the state of containers and pods. Pods are the smallest unit of deployment in Kubernetes; they usually have one container, but in some cases a pod may have two or more tightly coupled containers. If pods are not functioning correctly, they will be shut down and replaced. Kubernetes collects statistics, such as the number of desired pods and the number of available pods, which can be reported to Stackdriver. 

By default, Kubernetes Engine creates a cluster in a single zone. To improve availability, you can create a regional cluster in GKE, the managed service that distributes the underlying VMs across multiple zones within a region. GKE replicates masters and nodes across zones. This provides continued availability in the event of a zone failure. The redundant masters allow the cluster is a managed Kubernetes service that is used for container orchestration. By default, Kubernetes Engine creates a cluster in a single zone. 

If you are using any storage for stateful applications, always use globally persisted storage or a database like Big Query to store stateful information so that when you redeploy the Kubernetes cluster in different regions and can attach the storage back. This will protect from loss of the data.

4.3 App Engine and Cloud Functions
App Engine and Cloud Functions are fully managed compute services. Users of these services are not responsible for maintaining the availability of the computing resources. The Google Cloud Platform ensures the high availability of these services. Of course, App Engine and Cloud Functions applications and functions may fail and leave the application unavailable. 

The DR and BCP can be maintained using CICD workflows for App Engine and Cloud functions. As they are not associated with data, these can be redeployed in just a matter of minutes.

4.4 Big Query and other distributed databases 
BigQuery is by default deployed on USA multi-regional, so this will meet our DR and BCP requirements without any extra effort.

Other distributed databases like Spanner, BigTable and DataStore are globally distributed and no additional effort is required if you use these resources.

4.5 Cloud SQL
CloudSQL is a managed service for MySql and PostgreSql. CloudSQL supports Zonal and Multi-Zonal (Regional) implementations. 

CloudSQL does not support multi-Regional at this point. This can be achieved building CloudSQL instances in multiple regions and by setting up replication. This can be also achieved by backups and log shipping onto global storage. Based on SLAs and criticality, one of the methods can be implemented if you need multi-regional redundancy.
4.6 Cloud Storage and Persistent disks
Persistent disks are faster compared to cloud storage. Cloud storage is good for Object storage needs like data lake. Persistent disks support block storage and even SSDs.

Both Cloud storage and Persistent disks support Zonal, Regional and Global high availability.

If the data or information stored is very critical and needs to be retained, use multi-regional. Use multi-regional for all production and critical data needs.

Use regional for non prod and non critical needs.

Use nearline or coldline for backups, archive data storage needs which are not accessed very frequently.

4.7 Pub/Sub
Pub/Sub is managed google service for streaming and available as a global service. No extra effort needed for DR and BCP requirements.

4.8 Dataflow
Dataflow resources can be created as Zonal or Regional. Dataflow pipelines are currently deployed through CICD pipelines. In case of regional failures, the resources can be redeployed using continuous deployment(CD) pipelines onto different regions.

It's important that you will have proper CI/CD pipelines developed and automated for all the dataflow jobs.

4.9 Dataproc
Dataproc resources currently support only Zonal. Like Dataflow, you need to have continuous deployment(CD) pipelines to deploy any dataproc resources in case of unavailability of a particular zone/region.

Use multi-regional persistent disks or cloud storage if you need to retain any data.

4.10 Composer
Composer supports only Zonal currently. DR and BCP requirements can be met using proper fully tested CICD pipelines.

5 GCP DR and BCP matrix
The following table explains how each resource can be made to support DR and BCP requirements when required.

Resource or toolset HA (Zonal, Regional, Global) Description DR and BCP implementation
Static IP Regional, Global A static external IP address is an external IP address that is reserved for by our project until you decided to release it. Use global for load balancers. Use regional for VM instances.
Virtual Machine (Instance) Zonal Instance is like a server where you need to install and manage our own software. Use regional or multi-regional persistent disks where you need to retain the data or information in case of instance failures. Use live migration to bring back another similar VM in-place in case of failures. To fully Achieve HA and DR scenarios, use Managed Instance Groups with Instances built on Multiple Zones. Also you can use Global or Regional Load Balancing based on requirements.
Persistent disk Zonal, Regional and Multi-regional Storage that can be used by VM Instances or Kubernetes. Faster compared to Cloud storage. Also available as SSD storage for much faster access. If the data or information stored is very critical and needs to be retained, use multi-regional.
Cloud Storage Multi-Region or Regional. Also available as coldline or nearline for infrequent access needs. Used for data lake raw storage. Use multi-regional for all production and critical data needs. Use regional for non prod and non critical needs. Use nearline or coldline for backups, archive data storage needs which are not accessed very frequently.
Cloud Interconnects Global. interconnect attachments are regional A Cloud Interconnect is a highly available connection from on-premises network to Google's network. Always use global and redundant.
VPC network Global with individual subnets are regional VPCs are where projects live and resources are shared. Global with multi-regional and multi-zonal within regions.
Managed instance groups Regional, Global, Zonal Group of VM instances acts as a single unit. Use this to make compute instances are High Available.
Kubernetes Engine Default Zonal. Supports Regional Used for high available or scalable microservices or applications. Use global persistent disks to save state information if they are stateful. Use Continuous Delivery (CD) process to redeploy in case of regional failures.
App Engine and Cloud Functions Managed services Used to run microservices or applications. Managed services but prune to failures. Use Continuous Delivery (CD) process to deploy in case of region failures.
Big Query Managed Service. US multi-regional is default. Used for analytical data storage. USA multi-regional is default, fulfills our DR and BC requirements.
Cloud Dataflow Regional Used for ETL pipelines The metadata associated with data flow pipelines may be persisted through Cloud storage and Big Query. In case of failures, the pipelines can be rebuilt in different region through Continuous Delivery(CD) pipelines
Composer Zonal Used to run airflow ETL pipelines The metadata associated with data flow pipelines may be persisted through Cloud storage and Big Query. In case of failures, the pipelines can be rebuilt in different region through Continuous Delivery(CD) pipelines
Dataproc Zonal Used to run hadoop clusters In case of failures, the pipelines can be rebuilt in different region through Continuous Delivery(CD) pipelines Use multi-regional persistent disks if you need to retain data
Load Balancing Global, Zonal or Regional It's a managed service which can be used as a router to different Instances or applications. Use global for critical applications.
Cloud SQL Default same zone. Regional (multi-zonal) Managed service for MySQL and Postgresql. Can achieve global availability through replication. Can have backups and log shipping onto multi-regional storage.
Pub/Sub Global Managed streaming service. Global by default, fulfills DR and BC requirements.
BigTable Global Highly scalable NoSQL service Global by default, fulfills DR and BC requirements.
Spanner Global Highly scalable SQL database service Global by default, fulfills DR and BC requirements.
Datastore Global Highly scalable NoSQL service Global by default, fulfills DR and BC requirements.
6 Considerations and Assumptions

Even though this blog touches some network aspects, it doesn't fully cover all network requirements like having redundant networks between on-premise and GCP, having multiple interconnects if you are using vendor provided interconnects and so on. 

This blog also doesn’t fully cover OLTP database needs and other applications beyond the scope of the data lake as I don't have much insight into those projects.

This blog focuses on GCP regions as different geographically separated data centers and assumes GCP is sufficient for our DR and BCP needs. It does not cover the entire GCP shutdown for whatever the reasons, that's where the multi-cloud approach comes into picture.

7 Pricing and other costs
As far as high availability is concerned, global or multi-regional provide far superior coverage compared to Regional and Zonal. Regional provides much high availability compared to zonal. 

But from a cost perspective, global or multi-regional is far more expensive than regional and zonal. Regional is more expensive than zonal. Sometimes, costs might be more than doubled.

For example, to make VM instances redundant, you need to use 2 instances at least and you need a managed instance group(MIG) service and also there will be network changes to keep both instances in sync. The costs will be more than double in this case.

But for Kubernetes, it only adds a master server to make it more redundant in case of regional implementation, the cost will go up marginally but a lot less than two times.

But in case of managed services like BigQuery, Pub/Sub, App Engine, it will not incur any additional costs. 

For Cloud storage the costs are different, if you store data in multi-regional, regional, coldline and nearline tiers. Multi-regional is more expensive and coldline and nearline are cheaper. Sometimes, the cost difference is huge.

When you design and build resources, you have to keep pricing, SLAs and DR requirements in mind. Some solutions may satisfy the needs and are also cheaper. 

Besides pricing and other costs, there are also latencies that need to be kept in mind. Intra-zonal and intra-regional trips add a few milliseconds latencies to the application response times. Different layers of applications like front-end, back-end and database need to be aligned properly for better performance and to reduce network costs.


"About Author"

The author has extensive experience in Big Data Technologies and worked in the IT industry for over 25 years at various capacities after completing his BS and MS in computer science and data science respectively. He is certified cloud architect and holds several certifications from Microsoft and Google. Please contact him at srao@unifieddatascience.com if any questions.
By Srinivasa Rao June 19, 2023
Database types Realtime DB The database should be able to scale and keep up with the huge amounts of data that are coming in from streaming services like Kafka, IoT and so on. The SLA for latencies should be in milliseconds to very low seconds. The users also should be able to query the real time data and get millisecond or sub-second response times. Data Warehouse (Analytics) A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. The data is generally stored in denormalized form using Star or Snowflake schema. Data warehouse is used in a little broader scope, I would say we are trying to address Data Marts here which is a subset of the data warehouse and addresses a particular segment rather than addressing the whole enterprise. In this use case, the users not only query the real time data but also do some analytics, machine learning and reporting. OLAP OLAP is a kind of data structure where the data is stored in multi-dimensional cubes. The values (or measures) are stored at the intersection of the coordinates of all the dimensions.
By Srinivasa Rao June 18, 2023
This blog puts together Infrastructure and platform architecture for modern data lake. The following are taken into consideration while designing the architecture: Should be portable to any cloud and on-prem with minimal changes. Most of the technologies and processing will happen on Kubernetes so that it can be run on any Kubernetes cluster on any cloud or on-prem. All the technologies and processes use auto scaling features so that it will allocate and use resources minimally possible at any given time without compromising the end results. It will take advantage of spot instances and cost-effective features and technologies wherever possible to minimize the cost. It will use open-source technologies to save licensing costs. It will auto provision most of the technologies like Argo workflows, Spark, Jupyterhub (Dev environment for ML) and so on, which will minimize the use of the provider specific managed services. This will not only save money but also can be portable to any cloud or multi-cloud including on-prem. Concept The entire Infrastructure and Platform for modern data lakes and data platform consists of 3 main Parts at very higher level: Code Repository Compute Object store The main concept behind this design is “Work anywhere at any scale” with low cost and more efficiently. This design should work on any cloud like AWS, Azure or GCP and on on-premises. The entire infrastructure is reproducible on any cloud or on-premises platform and make it work with some minimal modifications to code. Below is the design diagram on how different parts interact with each other. The only pre-requisite to implement this is Kubernetes cluster and Object store.
By Srinivasa Rao June 17, 2023
Spark-On-Kubernetes is growing in adoption across the ML Platform and Data engineering. The goal of this blog is to create a multi-tenant Jupyter notebook server with built-in interactive Spark sessions support with Spark executors distributed as Kubernetes pods. Problem Statement Some of the disadvantages of using Hadoop (Big Data) clusters like Cloudera and EMR: Requires designing and build clusters which takes a lot of time and effort. Maintenance and support. Shared environment. Expensive as there are a lot of overheads like master nodes and so on. Not very flexible as different teams need different libraries. Different cloud technologies and on-premises come with different sets of big data implementations. Cannot be used for a large pool of users. Proposed solution The proposed solution contains 2 parts, which will work together to provide a complete solution. This will be implemented on Kubernetes so that it can work on any cloud or on-premises in the same fashion. I. Multi-tenant Jupyterhub JupyterHub allows users to interact with a computing environment through a webpage. As most devices have access to a web browser, JupyterHub makes it easy to provide and standardize the computing environment of a group of people (e.g., for a class of data scientists or an analytics team). This project will help us to set up our own JupyterHub on a cloud and leverage the cloud's scalable nature to support large groups of users. Thanks to Kubernetes, we are not tied to a specific cloud provider. II. Spark on Kubernetes (SPOK) Users can spin their own spark resources by creating sparkSession. Users can request several executors, cores per executor, memory per executor and driver memory along with other options. The Spark environment will be ready within a few seconds. Dynamic allocation will be used if none of those options are chosen. All the computes will be terminated if they’re idle for 30 minutes (or can be set by the user). The code will be saved to persistent storage and available when the user logs-in next time. Data Flow Diagram
Data lake design patterns on cloud. Build scalable and highly performing data lake on  Azure
By Srinivasa Rao May 9, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Microsoft (Azure) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on  AWS (Amazon)
By Srinivasa Rao May 8, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Amazon (AWS) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on google (GCP)
By Srinivasa Rao May 7, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the google (GCP) cloud.
Monitoring, Operations, Alerts and Notification and Support on Cloud
By Srinivasa Rao April 23, 2020
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information
By Srinivasa Rao April 22, 2020
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing. Security Covers overall security and IAM, Encryption, Data Access controls and related stuff. Please visit my blog for detailed information and implementation on cloud. https://www.unifieddatascience.com/security-architecture-for-google-cloud-datalakes Data Cataloging and Metadata It revolves around various metadata including technical, business and data pipeline (ETL, dataflow) metadata. Please refer to my blog for detailed information and how to implement it on Cloud. https://www.unifieddatascience.com/data-cataloging-metadata-on-cloud Data Discovery It is part of the data cataloging which explained in the last section. Auditing It is important to audit is consuming and accessing the data stored in the data lakes, which is another critical part of the data governance. Data Lineage There is no tool that can capture data lineage at various levels. Some of the Data lineage can be tracked through data cataloging and other lineage information can be tracked through few dedicated columns within actual tables. Most of the Big Data databases support complex column type, it can be tracked easily without much complexity. The following are some examples of data lineage information that can be tracked through separate columns within each table wherever required. 1. Data last updated/created (add last updated and create timestamp to each row). 2. Who updated the data (data pipeline, job name, username and so on - Use Map or Struct or JSON column type)? 3. How data was modified or added (storing update history where required - Use Map or Struct or JSON column type). Data Quality and MDM Master data contains all of your business master data and can be stored in a separate dataset. This data will be shared among all other projects/datasets. This will help you to avoid duplicating master data thus reducing manageability. This will also provide a single source of truth so that different projects don't show different values for the same. As this data is very critical, we will follow type 2 slowly changing dimensional approach which will be explained my other blog in detail. https://www.unifieddatascience.com/data-modeling-techniques-for-modern-data-warehousing There are lot of MDM tools available to manage master data more appropriately but for moderate use cases, you can store this using database you are using. MDM also deals with central master data quality and how to maintain it during different life cycles of the master data. There are several data governance tools available in the market like Allation, Collibra, Informatica, Apache Atlas, Alteryx and so on. When it comes to Cloud, my experience is it’s better to use cloud native tools mentioned above should be suffice for data lakes on cloud/
Overall security architecture on GCP briefly and puts together the data lake security design
By Srinivasa Rao April 21, 2020
Overall security architecture on GCP briefly and puts together the data lake security design and implementation steps.
Show More
Share by: