Blog Post

Infrastructure and Platform Architecture – Modern data lakes and databases

Srinivasa Rao • June 18, 2023
This blog puts together Infrastructure and platform architecture for modern data lake. The following are taken into consideration while designing the architecture:
  • Should be portable to any cloud and on-prem with minimal changes. 
  • Most of the technologies and processing will happen on Kubernetes so that it can be run on any Kubernetes cluster on any cloud or on-prem.
  • All the technologies and processes use auto scaling features so that it will allocate and use resources minimally possible at any given time without compromising the end results.
  • It will take advantage of spot instances and cost-effective features and technologies wherever possible to minimize the cost.
  • It will use open-source technologies to save licensing costs.
  • It will auto provision most of the technologies like Argo workflows, Spark, Jupyterhub (Dev environment for ML) and so on, which will minimize the use of the provider specific managed services. This will not only save money but also can be portable to any cloud or multi-cloud including on-prem.
Concept
The entire Infrastructure and Platform for modern data lakes and data platform consists of 3 main Parts at very higher level:
  • Code Repository
  • Compute
  • Object store
The main concept behind this design is “Work anywhere at any scale” with low cost and more efficiently. This design should work on any cloud like AWS, Azure or GCP and on on-premises. The entire infrastructure is reproducible on any cloud or on-premises platform and make it work with some minimal modifications to code.

Below is the design diagram on how different parts interact with each other. The only pre-requisite to implement this is Kubernetes cluster and Object store.

Object store

Object store is the critical part of the design and holds all the persistent data including entire data lake, Deep storage for databases (metadata, catalogs and online data) and stores logs or any other information that need to be persisted. Different policies will be created to archive and purge data based on the retention polices and compliance.


Code Repository

Entire Infrastructure, software and data pipelines will be created using a code repository which will play very key role in this project. Entire “compute” can be created using code repository in conjunction with devOps toolset.


Compute

All the Compute will be created on Kubernetes using code repository and CI/CD pipelines. It can work on any cloud or on-premises if Kubernetes environment is available.


We are planning to run most of the processing on spot instances as much as possible if we implement this on cloud, which will significantly reduce the cost. The ideal target for the entire Compute is 30% (or less) on regular instances and 70% (or more) on spot instances.


Technical Architecture

The following is the proposed infrastructure and platform technical architecture, which should satisfy all the Data Lake project technical needs. Most of the technologies will be built as self-serve capabilities using automation tools so that the developers, Data Scientists and ML Engineers can use them as needed.


Data Processing and ML Training

Most of the work is done during this stage of the project and most of the cost incurred for the project happens here.


Argo workflow and Argo Events

Argo is an open-source Workflow Engine for orchestrating tasks on Kubernetes. Argo allows us to create and run advanced workflows entirely on your Kubernetes cluster. Argo Workflows is built on top of Kubernetes, and each task is run as a separate Kubernetes pod. Many reputable organizations in the industry use Argo Workflows for ML (Machine Learning), ETL (Extract, Transform, Load), Data Processing, and CI/CD Pipelines.

 

Argo is basically a Kubernetes extension and hence, it is installed using Kubernetes. Argo Workflows allows organizations to define their tasks as DAGs using YAML. Argo comes with a native Workflow Archive for auditing, Cron Workflows for scheduled workflows, and a fully featured REST API. Argo comes with a list of killer features that set it apart from similar products.

Note: the above diagram is courtesy of Argo documentation.


Argo Events is an event-driven workflow automation framework for Kubernetes which helps us to trigger K8s objects, Argo Workflows, Serverless workloads, etc. on events from a variety of sources like webhooks, S3, schedules, messaging queues, gcp pubsub, sns, sqs, etc.

Note: the above diagram is courtesy of Argo documentation.


Key Features and advantages of Argo

Open-Source: Argo is a fully open-source project at the Cloud Native Computing Foundation (CNCF). It is available free of cost for everyone to use.

Native Integrations: Argo comes with native artifact support to download, transport, and upload files during runtime. It supports any S3 compatible Artifact Repository such as AWS, GCS, HTTP, Git, Raw, and Minio. Also, it’s easy to integrate with Spark using SPOK (Spark on Kubernetes). It can be used to run Kubeflow ML pipelines.

Scalability: Argo Workflows has robust retry mechanisms for high reliability and is highly scalable. It can manage thousands of pods and workflows in parallel.

Customizability: Argo is highly customizable, and it supports templating and composability to create and reuse workflows.

Powerful User Interface: Argo comes with a fully featured User Interface (UI) that is easy to use. Argo Workflows v3.0 UI also supports Argo Events and is more robust and reliable. It has embeddable widgets and a new workflow log viewer.

Cost savings: We can run a lot of processing on Spot instances where permitted, which saves lot of cost.


Spark (SPOK)

It is recommended to use Spark for most of the processing wherever possible including data ingestion, transformation and for ML training. Spark is a reliable and proven distributed processing engine and can process the ETL pipelines very faster. This is heavily used in the industry and standard for any big data and data lake project. We can also leverage on the spot instances for spark processing that reduces the overall cost of the project.

 

The following is an example of how we can effectively run Spark on Spot node pools. As all the processing happens on executor nodes. Running executors on spot node pools tremendously saves cost. Spark also can reprocess any failed tasks automatically which makes it a reliable candidate to run workloads on spot instances.New Paragraph

ArgoCD

ArgoCD allows us to use custom or vendor provided images to process the incoming data and any other workloads. We can also run this on Spot instance node pool with retry’s. Some enhancements may be required to run them on Spot instance node pools.


Hive

Hive can be used to access data stored on any file system using SQL. It can be used for metadata cataloging as well. Hive also comes with ACID transactional features, which can be used for delta lake use cases if any.


Delta lake or Hive Transactional

Delta lake allows row level manipulation of data and can-do time travel on data. Hive transactional, Databricks and Iceberg are some of the technologies that will facilitate delta lake on Big Data.


Streaming (Kafka)

Kafka or similar technologies can be used where we need to process non file based incoming data in real-time.


JupyterHub

JupyterHub brings the power of notebooks to groups of users. It gives users access to computational environments and resources without burdening the users with installation and maintenance tasks. Users - including engineers and data scientists - can get their work done in their own workspaces on shared resources which can be managed efficiently by system administrators.

JupyterHub runs on Kubernetes makes it possible to serve a pre-configured data science environment to any user.


Persistent data storage

Object Store (S3, Azure Blob, GCS) plays a very critical role in storing all the data, which is a key part of this project. The following are some of the ways to store and process the data from Object store cost effectively:

❖   Compress data effectively. This will reduce storage costs, IO cost and reduce processing times.

❖   Retain only required data. Have an Archival process in place.

❖   Partition data effectively. This will reduce the IO costs when processing the data.

❖   Reduce accessing raw data as much as possible by incorporating different presentation layers with minimal data.

❖   Utilize cold storage wherever possible.


Virtualization layer

As accessing data from Object store is cumbersome, it's a good idea to build virtualization layer on top of it. This will bring the following advantages to raw data:

❖   Data can be accessed using SQL statements.

❖   This will serve as a metadata catalog on top of the Object store.

❖   Can use partitioning and other optimization techniques to reduce IO when consuming data from the Object store.

Hive, Athena, Presto and Redshift Spectrum are some of the technologies that can be used for this purpose.


Database (Speed/Consumption data layer)

The OLAP and/or NoSQL database may be required to satisfy the end user data consumption needs. Databases like Starrocks, Druid or Cassandra may be used for this purpose. The selection of the database is based on following requirements:

❖   The volume of the data stored in the database.

❖   Amount of the data ingested by time interval.

❖   How the data is ingested is like real time streaming and/or batch processing.

❖   Response times needed while accessing data by end users.

❖   QPS/TPS

❖   Data access patterns.


Data facilitation and secure data access

This stage is lightweight and facilitates access to the data, which is stored in a persistent data store like database. This will also provide authentication and authorization mechanisms to access the data.

Serverless Cloud functions and/or microservice architecture may be used to facilitate REST APIs.


Orchestration and DevOps


Kubernetes, Docker, and Container services

Kubernetes, Docker, and Container services play key roles in this project. These technologies facilitate most of the components mentioned above. The technologies like Spark will be built and readily available as containers for developers to develop workflows, pipelines and so on using them.


Kubernetes cluster will have different node pools (or groups) to run different workloads. Most of the processing pipelines and model training can run on node pools(groups) with spot instances.


CI/CD

Jenkins (CI) and Argo CD (CD) will be used for continuous integration and continuous deployment into Kubernetes. Terraform may be used for non-Kubernetes related cloud deployments. Some additional tools like Chef or Ansible may be leveraged when the need arises for those technologies.


Git is used as a code repository, which will be integrated with CI/CD pipelines.New Paragraph

"About Author"

The author has extensive experience in Big Data Technologies and worked in the IT industry for over 25 years at various capacities after completing his BS and MS in computer science and data science respectively. He is certified cloud architect and holds several certifications from Microsoft and Google. Please contact him at srao@unifieddatascience.com if any questions.
By Srinivasa Rao June 19, 2023
Database types Realtime DB The database should be able to scale and keep up with the huge amounts of data that are coming in from streaming services like Kafka, IoT and so on. The SLA for latencies should be in milliseconds to very low seconds. The users also should be able to query the real time data and get millisecond or sub-second response times. Data Warehouse (Analytics) A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. The data is generally stored in denormalized form using Star or Snowflake schema. Data warehouse is used in a little broader scope, I would say we are trying to address Data Marts here which is a subset of the data warehouse and addresses a particular segment rather than addressing the whole enterprise. In this use case, the users not only query the real time data but also do some analytics, machine learning and reporting. OLAP OLAP is a kind of data structure where the data is stored in multi-dimensional cubes. The values (or measures) are stored at the intersection of the coordinates of all the dimensions.
By Srinivasa Rao June 17, 2023
Spark-On-Kubernetes is growing in adoption across the ML Platform and Data engineering. The goal of this blog is to create a multi-tenant Jupyter notebook server with built-in interactive Spark sessions support with Spark executors distributed as Kubernetes pods. Problem Statement Some of the disadvantages of using Hadoop (Big Data) clusters like Cloudera and EMR: Requires designing and build clusters which takes a lot of time and effort. Maintenance and support. Shared environment. Expensive as there are a lot of overheads like master nodes and so on. Not very flexible as different teams need different libraries. Different cloud technologies and on-premises come with different sets of big data implementations. Cannot be used for a large pool of users. Proposed solution The proposed solution contains 2 parts, which will work together to provide a complete solution. This will be implemented on Kubernetes so that it can work on any cloud or on-premises in the same fashion. I. Multi-tenant Jupyterhub JupyterHub allows users to interact with a computing environment through a webpage. As most devices have access to a web browser, JupyterHub makes it easy to provide and standardize the computing environment of a group of people (e.g., for a class of data scientists or an analytics team). This project will help us to set up our own JupyterHub on a cloud and leverage the cloud's scalable nature to support large groups of users. Thanks to Kubernetes, we are not tied to a specific cloud provider. II. Spark on Kubernetes (SPOK) Users can spin their own spark resources by creating sparkSession. Users can request several executors, cores per executor, memory per executor and driver memory along with other options. The Spark environment will be ready within a few seconds. Dynamic allocation will be used if none of those options are chosen. All the computes will be terminated if they’re idle for 30 minutes (or can be set by the user). The code will be saved to persistent storage and available when the user logs-in next time. Data Flow Diagram
Data lake design patterns on cloud. Build scalable and highly performing data lake on  Azure
By Srinivasa Rao May 9, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Microsoft (Azure) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on  AWS (Amazon)
By Srinivasa Rao May 8, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Amazon (AWS) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on google (GCP)
By Srinivasa Rao May 7, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the google (GCP) cloud.
Different strategies to fully implement DR and BCP across the GCP toolset and resources.
By Srinivasa Rao April 23, 2020
Different strategies to fully implement DR and BCP across the toolset and resources you are currently using and probably will use in near future on GCP.
Monitoring, Operations, Alerts and Notification and Support on Cloud
By Srinivasa Rao April 23, 2020
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information
By Srinivasa Rao April 22, 2020
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing. Security Covers overall security and IAM, Encryption, Data Access controls and related stuff. Please visit my blog for detailed information and implementation on cloud. https://www.unifieddatascience.com/security-architecture-for-google-cloud-datalakes Data Cataloging and Metadata It revolves around various metadata including technical, business and data pipeline (ETL, dataflow) metadata. Please refer to my blog for detailed information and how to implement it on Cloud. https://www.unifieddatascience.com/data-cataloging-metadata-on-cloud Data Discovery It is part of the data cataloging which explained in the last section. Auditing It is important to audit is consuming and accessing the data stored in the data lakes, which is another critical part of the data governance. Data Lineage There is no tool that can capture data lineage at various levels. Some of the Data lineage can be tracked through data cataloging and other lineage information can be tracked through few dedicated columns within actual tables. Most of the Big Data databases support complex column type, it can be tracked easily without much complexity. The following are some examples of data lineage information that can be tracked through separate columns within each table wherever required. 1. Data last updated/created (add last updated and create timestamp to each row). 2. Who updated the data (data pipeline, job name, username and so on - Use Map or Struct or JSON column type)? 3. How data was modified or added (storing update history where required - Use Map or Struct or JSON column type). Data Quality and MDM Master data contains all of your business master data and can be stored in a separate dataset. This data will be shared among all other projects/datasets. This will help you to avoid duplicating master data thus reducing manageability. This will also provide a single source of truth so that different projects don't show different values for the same. As this data is very critical, we will follow type 2 slowly changing dimensional approach which will be explained my other blog in detail. https://www.unifieddatascience.com/data-modeling-techniques-for-modern-data-warehousing There are lot of MDM tools available to manage master data more appropriately but for moderate use cases, you can store this using database you are using. MDM also deals with central master data quality and how to maintain it during different life cycles of the master data. There are several data governance tools available in the market like Allation, Collibra, Informatica, Apache Atlas, Alteryx and so on. When it comes to Cloud, my experience is it’s better to use cloud native tools mentioned above should be suffice for data lakes on cloud/
Overall security architecture on GCP briefly and puts together the data lake security design
By Srinivasa Rao April 21, 2020
Overall security architecture on GCP briefly and puts together the data lake security design and implementation steps.
Show More
Share by: