Blog Post

Data lake design patterns on google (GCP) cloud

Srinivasa Rao • May 7, 2020

Build scalable data lakes on cloud (GCP)

Unlike the traditional data warehousing, complex data lake often involves combination of multiple technologies. It is very important to understand those technologies and also learn how to integrate them effectively. This blog walks through different patterns for successful implementation any data lake on google cloud platform.

Pattern I: Full Data lake stack
Pattern II: Unified Batch and Streaming model
Pattern III: Lambda streaming architecture
When we are building any scalable and high performing data lakes on cloud or on-premise, there are two broader groups of toolset and processes play critical role. One kind of toolset involves in building data pipelines and storing the data. The another set of toolset or processes does not involve directly in the data lake design and development but plays very critical role in the success of any data lake implementation like data governance and data operations.

The above diagrams show how different google managed services can be used and integrated to make it full blown and scalable data lake. You may add and remove certain tools based on the use cases, but the data lake implementation mainly moves around these concepts.

Here is the brief description about each component in the above diagrams.

1. Data Sources
The data can come from multiple desperate data sources and data lake should be able to handle all the incoming data. The following are the some of the sources:
• OLTP systems like Oracle, SQL Server, MySQL or any RDBMS.
• Various File formats like CSV, JSON, AVRO, XML, Binary and so on.
• Text based and IOT Streaming data

2. Data Ingestion
Collecting and processing the incoming data from various data sources is the critical part of any successful data lake implementation. This is actually most time consuming and resource intensive step.

GCP has various highly scalable managed services to develop and implement very complicated data pipelines of any scale.

Cloud Dataflow
Cloud Dataflow is a google fully managed service where you can build unified batch and streaming data pipelines. It is built on open source Apache beam project. It also provides horizontal scaling and tightly integrated with other Big Data components like Big Query, Pub/Sub and Stack driver.


Cloud Dataproc
Cloud Dataproc is a managed google service for Hadoop/Spark echo system. You can use Dataproc for various purposes:
• To build data pipelines using spark, especially when you have lot of code written in Spark when migrating from the on-premise. 
• To do Lift and Shift existing Hadoop environment from onsite to cloud.
• If you want to use Hive and HBase databases part of your use cases. 
• To build Machine learning and AI pipelines using Spark.

Dataproc clusters can be built on on-demand and also can be auto scaled depending on the need. You can also use preemptive VMs where you don’t need production scale SLAs, which costs lot less compare to using regular instances.


Cloud Composer
Cloud Composer is a fully managed workflow engine built on top of open source Apache Airflow. You can call dataflow and Dataproc spark jobs from Composer and create a controlled workflow and schedule them to run.


Cloud Data Fusion
Cloud Data Fusion is a new entry to GCP ETL/ELT stack. It is a fully managed and cloud native data pipeline tool. It allows developers build high performing data pipelines graphically. It also has huge set of open source library of preconfigured connectors that you can connect to most of the data sources. It’s kind of code free data pipeline model where you save lot of time on developing complex transformations and data ingestion.


3. Raw data layer
Object storage is central to any data lake implementation. Cloud storage serves as raw layer. You can build highly scalable and highly available data lake raw layer using Cloud storage which also provides very high SLAs.

There are also different classes of Object storage like multi-regional, regional, coldline and nearline which can be used for different purposes based on requirements.


4. Consumption layer
All the items mentioned before are internal to data lake and will not be exposed for external user. Consumption layer is where you store curated and processed data for end user consumption. The end user applications can be reports, web applications, data extracts or APIs.

The following is some of the criteria while choosing database for the consumption layer:
• Volume of the data.
• Kind of the data retrieval patterns like whether applications use analytical type of queries like using aggregations and computations or retrieves just based on some filtering.
• How the data ingestion happens whether it’s in large batches or high throughput writes (IOT or Streaming) and so on.
• Whether the data is structured, semi-structured, quasi-structured or unstructured.
• SLAs.

Cloud BigTable
BigTable is a petabyte scale NoSQL managed services database which is good for both analytical and operational workloads. It is very good for high throughput read/writes and low latency (sub 10ms) database needs. Some of the use cases for BigTable are streaming, time series data, ad tech, fintech and so on. BigTable provides eventual consistency and atomic transactions. It is little expensive compared to BigQuery and Datastore.


BigQuery
BigQuery is an important part of the google data lake stack. It is a data warehouse database, serverless managed service and can scale on petabytes of data. BigQuery is very good for analytical queries where you use lot of aggregations and computations.

BigQuery also comes with ML, GIS and BI engine features. 


Cloud Datastore (Firestore)

Cloud Datastore is a highly scalable NoSQL database, ACID compliant and also allows indexing. It is also schemaless document store. Useful where you need NoSQL document DB use cases and transactions. 


5. Machine Learning and Data Science
Machine Learning and Data science teams are biggest consumers of the data lake data. They use this data to train their models, forecast and use the trained models to apply for future data variables.

Cloud Dataprep
Cloud Dataprep is a managed service that will help to clean, explore and prepare data, which can be used for data analytics and machine learning. It also comes with visual graphical user interface.

Cloud Dataproc
The Spark ML which is part of the Cloud Dataproc can be used for building machine learning pipelines and to build and train data models.

Cloud AutoML
Cloud AutoML provides various machine learning tools which allows to build, and train various high-quality machine learning data models required by business. Some of the AutoML tools include vision, video intelligence, natural language, translation and tables.


AI Hub
Google provides various plug and play AI components into AI Hub. 


a. Data Governance 
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing.

Data Catalog
Data Catalog is a fully managed metadata management service which can be fully integrated with other components like Cloud storage, Big Query and Pub/Sub. You can quickly discover, understand and manage the data stored in your data lake.

You can view my blog for detailed information on data catalog.

Cloud Search
Cloud search is a kind of enterprise search tool that will allow you quickly, easily, and securely find information.


Cloud IAM
Please refer to my data governance blog for more details.

Cloud KMS
Cloud KMS is a hosted KMS that lets us manage encryption keys in the cloud. We can create/generate, rotate, use, and destroy AES256 encryption keys just like we would in our on-premises environments. We can also use the cloud KMS REST API to encrypt and decrypt data. 

b. Data Operations
Operations, Monitoring and Support is key part of any data lake implementations. Google provides various tools to accomplish this.

Stack driver
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information. 

Please refer to my blog cloud operations for full details.

Audit Logging
Cloud Audit Logs maintains three audit logs for each Google Cloud project, folder, and organization: Admin Activity, Data Access, and System Event. Google Cloud services write audit log entries to these logs to help us answer the questions of "who did what, where, and when?" within your Google Cloud resources.
Please refer to my blog for more details.




"About Author"

The author has extensive experience in Big Data Technologies and worked in the IT industry for over 25 years at various capacities after completing his BS and MS in computer science and data science respectively. He is certified cloud architect and holds several certifications from Microsoft and Google. Please contact him at srao@unifieddatascience.com if any questions.
By Srinivasa Rao June 19, 2023
Database types Realtime DB The database should be able to scale and keep up with the huge amounts of data that are coming in from streaming services like Kafka, IoT and so on. The SLA for latencies should be in milliseconds to very low seconds. The users also should be able to query the real time data and get millisecond or sub-second response times. Data Warehouse (Analytics) A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. The data is generally stored in denormalized form using Star or Snowflake schema. Data warehouse is used in a little broader scope, I would say we are trying to address Data Marts here which is a subset of the data warehouse and addresses a particular segment rather than addressing the whole enterprise. In this use case, the users not only query the real time data but also do some analytics, machine learning and reporting. OLAP OLAP is a kind of data structure where the data is stored in multi-dimensional cubes. The values (or measures) are stored at the intersection of the coordinates of all the dimensions.
By Srinivasa Rao June 18, 2023
This blog puts together Infrastructure and platform architecture for modern data lake. The following are taken into consideration while designing the architecture: Should be portable to any cloud and on-prem with minimal changes. Most of the technologies and processing will happen on Kubernetes so that it can be run on any Kubernetes cluster on any cloud or on-prem. All the technologies and processes use auto scaling features so that it will allocate and use resources minimally possible at any given time without compromising the end results. It will take advantage of spot instances and cost-effective features and technologies wherever possible to minimize the cost. It will use open-source technologies to save licensing costs. It will auto provision most of the technologies like Argo workflows, Spark, Jupyterhub (Dev environment for ML) and so on, which will minimize the use of the provider specific managed services. This will not only save money but also can be portable to any cloud or multi-cloud including on-prem. Concept The entire Infrastructure and Platform for modern data lakes and data platform consists of 3 main Parts at very higher level: Code Repository Compute Object store The main concept behind this design is “Work anywhere at any scale” with low cost and more efficiently. This design should work on any cloud like AWS, Azure or GCP and on on-premises. The entire infrastructure is reproducible on any cloud or on-premises platform and make it work with some minimal modifications to code. Below is the design diagram on how different parts interact with each other. The only pre-requisite to implement this is Kubernetes cluster and Object store.
By Srinivasa Rao June 17, 2023
Spark-On-Kubernetes is growing in adoption across the ML Platform and Data engineering. The goal of this blog is to create a multi-tenant Jupyter notebook server with built-in interactive Spark sessions support with Spark executors distributed as Kubernetes pods. Problem Statement Some of the disadvantages of using Hadoop (Big Data) clusters like Cloudera and EMR: Requires designing and build clusters which takes a lot of time and effort. Maintenance and support. Shared environment. Expensive as there are a lot of overheads like master nodes and so on. Not very flexible as different teams need different libraries. Different cloud technologies and on-premises come with different sets of big data implementations. Cannot be used for a large pool of users. Proposed solution The proposed solution contains 2 parts, which will work together to provide a complete solution. This will be implemented on Kubernetes so that it can work on any cloud or on-premises in the same fashion. I. Multi-tenant Jupyterhub JupyterHub allows users to interact with a computing environment through a webpage. As most devices have access to a web browser, JupyterHub makes it easy to provide and standardize the computing environment of a group of people (e.g., for a class of data scientists or an analytics team). This project will help us to set up our own JupyterHub on a cloud and leverage the cloud's scalable nature to support large groups of users. Thanks to Kubernetes, we are not tied to a specific cloud provider. II. Spark on Kubernetes (SPOK) Users can spin their own spark resources by creating sparkSession. Users can request several executors, cores per executor, memory per executor and driver memory along with other options. The Spark environment will be ready within a few seconds. Dynamic allocation will be used if none of those options are chosen. All the computes will be terminated if they’re idle for 30 minutes (or can be set by the user). The code will be saved to persistent storage and available when the user logs-in next time. Data Flow Diagram
Data lake design patterns on cloud. Build scalable and highly performing data lake on  Azure
By Srinivasa Rao May 9, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Microsoft (Azure) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on  AWS (Amazon)
By Srinivasa Rao May 8, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Amazon (AWS) cloud.
Different strategies to fully implement DR and BCP across the GCP toolset and resources.
By Srinivasa Rao April 23, 2020
Different strategies to fully implement DR and BCP across the toolset and resources you are currently using and probably will use in near future on GCP.
Monitoring, Operations, Alerts and Notification and Support on Cloud
By Srinivasa Rao April 23, 2020
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information
By Srinivasa Rao April 22, 2020
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing. Security Covers overall security and IAM, Encryption, Data Access controls and related stuff. Please visit my blog for detailed information and implementation on cloud. https://www.unifieddatascience.com/security-architecture-for-google-cloud-datalakes Data Cataloging and Metadata It revolves around various metadata including technical, business and data pipeline (ETL, dataflow) metadata. Please refer to my blog for detailed information and how to implement it on Cloud. https://www.unifieddatascience.com/data-cataloging-metadata-on-cloud Data Discovery It is part of the data cataloging which explained in the last section. Auditing It is important to audit is consuming and accessing the data stored in the data lakes, which is another critical part of the data governance. Data Lineage There is no tool that can capture data lineage at various levels. Some of the Data lineage can be tracked through data cataloging and other lineage information can be tracked through few dedicated columns within actual tables. Most of the Big Data databases support complex column type, it can be tracked easily without much complexity. The following are some examples of data lineage information that can be tracked through separate columns within each table wherever required. 1. Data last updated/created (add last updated and create timestamp to each row). 2. Who updated the data (data pipeline, job name, username and so on - Use Map or Struct or JSON column type)? 3. How data was modified or added (storing update history where required - Use Map or Struct or JSON column type). Data Quality and MDM Master data contains all of your business master data and can be stored in a separate dataset. This data will be shared among all other projects/datasets. This will help you to avoid duplicating master data thus reducing manageability. This will also provide a single source of truth so that different projects don't show different values for the same. As this data is very critical, we will follow type 2 slowly changing dimensional approach which will be explained my other blog in detail. https://www.unifieddatascience.com/data-modeling-techniques-for-modern-data-warehousing There are lot of MDM tools available to manage master data more appropriately but for moderate use cases, you can store this using database you are using. MDM also deals with central master data quality and how to maintain it during different life cycles of the master data. There are several data governance tools available in the market like Allation, Collibra, Informatica, Apache Atlas, Alteryx and so on. When it comes to Cloud, my experience is it’s better to use cloud native tools mentioned above should be suffice for data lakes on cloud/
Overall security architecture on GCP briefly and puts together the data lake security design
By Srinivasa Rao April 21, 2020
Overall security architecture on GCP briefly and puts together the data lake security design and implementation steps.
Show More
Share by: