Blog Post

Unified Lambda Architecture (Realtime DB + DW + OLAP)

Srinivasa Rao • June 19, 2023

Database types

Realtime DB
The database should be able to scale and keep up with the  huge amounts of data that are coming in from streaming services like Kafka, IoT and so on. The SLA for latencies should be in milliseconds to very low seconds.

The users also should be able to query the real time data and get millisecond or sub-second response times.

Data Warehouse (Analytics)
A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. The data is generally stored in denormalized form using Star or Snowflake schema.

Data warehouse is used in a little broader scope, I would say we are trying to address Data Marts here which is a subset of the data warehouse and addresses a particular segment rather than addressing the whole enterprise.

In this use case, the users not only query the real time data but also do some analytics, machine learning and reporting.

OLAP
OLAP is a kind of data structure where the data is stored in multi-dimensional cubes. The values (or measures) are stored at the intersection of the coordinates of all the dimensions.

Blue → 2017 Apr + “Cost of Goods Sold” + Accessories

Yellow → Clothing + “Gross Revenue” + 2016 Jan

If we want 2015 Units sold for Accessories → sum up corresponding 5 coordinates. Each Coordinate is the sum of all the underlying transactions.


Problem Statement


Any proposed solution should be able to address the following 4 main areas:

●     It should be able to consume the data from streaming systems like Kafka in real time with millisecond or very low second latencies. It should also process the data in batches.

●     Users should be able to query data with millisecond to very low second latencies.

●     The database should be able to scale up to 100’s of concurrent queries.

●     The users should be able to do analytics on the source data.

Why not databases like RedShift?

●     Can not stream huge amounts of data.

●     Does not give millisecond response times.

●     Does not provide high concurrency.

●     Managed service might cost more.


Why not NoSQL (e.g.. Cassandra, Dynamo, HBase, MongoDB/Couchbase)

●     NOSQL databases like Cassandra and HBase scales with Real-time streaming. Mongo/Couchbase will not scale for real time streaming of that volume.

●     Provides Millisecond response times for querying data on partition keys only.

●     Provides High Concurrency.

●     Does not support analytics.

●     Does not support OLAP.


Unified Lambda Architecture


This blog proposes Unified Analytical solution (Realtime DW + OLAP) that addresses the following:

●     Real Time data streaming

●     Real Time Analytics

●     High Concurrency

●     OLAP Analytics

●     High Availability

●     Cost effective.


The proposed solution will solve the current use cases and provide a very good enterprise solution and add a lot of value to the data like providing more real-time reporting, analytics, training models on real time data and also on large volume of data.

The above diagram explains how the proposed solution will work. It uses “unified lambda architecture” which provides Batch, Stream, and Speed layers together. The data ingested into detailed tables and Rollups/Cubes will be synced behind the scenes based on the design and instructions provided.

 

When users try to access data, the query optimizer redirects to underlying data (can be detailed, aggregated and both) based on the query need. End users need not know the underlying mechanism and they just simply use SQL to access the data.


Technologies


      Apache Druid

      Apache Starrocks

      Apache Kylin

 

The following table compares the features for each technology mentioned above:

Item Druid Kylin Starrocks
Type Time Series with good OLAP support Pure OLAP (MOLAP) (Uses mdx cubes) Real-time Data warehouse with OLAP (uses materializers)
Concurrency/Throughput High Very high Very high
Storage Columnar/Parquet HBase/Parquet Columnar/Parquet
Indexing Time is always primary partition All dimensions are equally partitioned/indexed. Primary, bitmap, bloom filter.
Updates/Deletes Does not support Does not support Supports updates/deletes provided tables created with proper schema.
OLAP Some rollups, mainly time based. Full OLAP (MOLAP) Uses aggregate and materialized views. Supports MOLAP/HOLAP/ROLAP
Real-time ingestion Scales to 100s of millions per hour. Scales to 100s of millions per hour. Scales to 100s of millions per hour
Analytics Supports aggregates, slice, and dice of the data. Very faster analytics compared to any other databases because it stores data in MOLAP format and pre-aggregates all possible permutation and combinations of all the dimensions Can create all possible aggregation and can be predefined what aggregations we need for faster analytics.
Star Schema/Data Warehouse Supports through in-built lookups. Or need to follow a mixed approach. Fact table will be in Druid and Dimension lookups can be in some other database like Redis. Supports Star schema while processing. But dimension metadata needs to be either cached or stored somewhere like Redis for faster access. Support full star schema and even snowflakes.
Window functions/complex SQL Does not support To good extent Fully supported
Table joins Support to certain extent but its use is not recommended Support to certain extent but its use is not recommended Fully supported.
Hive Tables/external tables Not supported Not supported Supported. Also supports mySql tables.
Connectivity HTTP and its own connectors HTTP and its own connectors Uses MySQL client connectors and MySQL JDBC, so it supports wherever MySQL can connect.
Maintenance Moderate High Low to Moderate
Vendor support Imply Kyligence Starrocks


Cube Vs Materializers


An OLAP cube is a data structure that overcomes the limitations of relational databases by providing rapid analysis of data. Cubes can display and sum large amounts of data while also providing users with searchable access to any data points. This way, the data can be rolled up, sliced, and diced as needed to handle the widest variety of questions that are relevant to a user's area of interest.

 

Processing of an OLAP cube occurs when all the aggregations for the cube are calculated and the cube is loaded with these aggregations and data. Dimension and fact tables are read, and the data is calculated and loaded into the cube. When we design an OLAP cube, processing must be carefully considered because of the potentially significant effect that processing might have in a production environment where millions of records may exist. A full process is very compute intensive and takes a very long time based on the amount of data and computation we are using.

 

Adding more dimensions adds more complexity to cube processing. When real-time data comes in, it might take some time to add it to all partitions.

 

The amount of cube processing can be reduced by optimizing it based on query usage and/or space-based optimization.

 

On the other hand, Materializers are a small subset of the Cube which will reduce time and complexity of cube processing.

"About Author"

The author has extensive experience in Big Data Technologies and worked in the IT industry for over 25 years at various capacities after completing his BS and MS in computer science and data science respectively. He is certified cloud architect and holds several certifications from Microsoft and Google. Please contact him at srao@unifieddatascience.com if any questions.
By Srinivasa Rao June 18, 2023
This blog puts together Infrastructure and platform architecture for modern data lake. The following are taken into consideration while designing the architecture: Should be portable to any cloud and on-prem with minimal changes. Most of the technologies and processing will happen on Kubernetes so that it can be run on any Kubernetes cluster on any cloud or on-prem. All the technologies and processes use auto scaling features so that it will allocate and use resources minimally possible at any given time without compromising the end results. It will take advantage of spot instances and cost-effective features and technologies wherever possible to minimize the cost. It will use open-source technologies to save licensing costs. It will auto provision most of the technologies like Argo workflows, Spark, Jupyterhub (Dev environment for ML) and so on, which will minimize the use of the provider specific managed services. This will not only save money but also can be portable to any cloud or multi-cloud including on-prem. Concept The entire Infrastructure and Platform for modern data lakes and data platform consists of 3 main Parts at very higher level: Code Repository Compute Object store The main concept behind this design is “Work anywhere at any scale” with low cost and more efficiently. This design should work on any cloud like AWS, Azure or GCP and on on-premises. The entire infrastructure is reproducible on any cloud or on-premises platform and make it work with some minimal modifications to code. Below is the design diagram on how different parts interact with each other. The only pre-requisite to implement this is Kubernetes cluster and Object store.
By Srinivasa Rao June 17, 2023
Spark-On-Kubernetes is growing in adoption across the ML Platform and Data engineering. The goal of this blog is to create a multi-tenant Jupyter notebook server with built-in interactive Spark sessions support with Spark executors distributed as Kubernetes pods. Problem Statement Some of the disadvantages of using Hadoop (Big Data) clusters like Cloudera and EMR: Requires designing and build clusters which takes a lot of time and effort. Maintenance and support. Shared environment. Expensive as there are a lot of overheads like master nodes and so on. Not very flexible as different teams need different libraries. Different cloud technologies and on-premises come with different sets of big data implementations. Cannot be used for a large pool of users. Proposed solution The proposed solution contains 2 parts, which will work together to provide a complete solution. This will be implemented on Kubernetes so that it can work on any cloud or on-premises in the same fashion. I. Multi-tenant Jupyterhub JupyterHub allows users to interact with a computing environment through a webpage. As most devices have access to a web browser, JupyterHub makes it easy to provide and standardize the computing environment of a group of people (e.g., for a class of data scientists or an analytics team). This project will help us to set up our own JupyterHub on a cloud and leverage the cloud's scalable nature to support large groups of users. Thanks to Kubernetes, we are not tied to a specific cloud provider. II. Spark on Kubernetes (SPOK) Users can spin their own spark resources by creating sparkSession. Users can request several executors, cores per executor, memory per executor and driver memory along with other options. The Spark environment will be ready within a few seconds. Dynamic allocation will be used if none of those options are chosen. All the computes will be terminated if they’re idle for 30 minutes (or can be set by the user). The code will be saved to persistent storage and available when the user logs-in next time. Data Flow Diagram
Data lake design patterns on cloud. Build scalable and highly performing data lake on  Azure
By Srinivasa Rao May 9, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Microsoft (Azure) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on  AWS (Amazon)
By Srinivasa Rao May 8, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Amazon (AWS) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on google (GCP)
By Srinivasa Rao May 7, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the google (GCP) cloud.
Different strategies to fully implement DR and BCP across the GCP toolset and resources.
By Srinivasa Rao April 23, 2020
Different strategies to fully implement DR and BCP across the toolset and resources you are currently using and probably will use in near future on GCP.
Monitoring, Operations, Alerts and Notification and Support on Cloud
By Srinivasa Rao April 23, 2020
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information
By Srinivasa Rao April 22, 2020
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing. Security Covers overall security and IAM, Encryption, Data Access controls and related stuff. Please visit my blog for detailed information and implementation on cloud. https://www.unifieddatascience.com/security-architecture-for-google-cloud-datalakes Data Cataloging and Metadata It revolves around various metadata including technical, business and data pipeline (ETL, dataflow) metadata. Please refer to my blog for detailed information and how to implement it on Cloud. https://www.unifieddatascience.com/data-cataloging-metadata-on-cloud Data Discovery It is part of the data cataloging which explained in the last section. Auditing It is important to audit is consuming and accessing the data stored in the data lakes, which is another critical part of the data governance. Data Lineage There is no tool that can capture data lineage at various levels. Some of the Data lineage can be tracked through data cataloging and other lineage information can be tracked through few dedicated columns within actual tables. Most of the Big Data databases support complex column type, it can be tracked easily without much complexity. The following are some examples of data lineage information that can be tracked through separate columns within each table wherever required. 1. Data last updated/created (add last updated and create timestamp to each row). 2. Who updated the data (data pipeline, job name, username and so on - Use Map or Struct or JSON column type)? 3. How data was modified or added (storing update history where required - Use Map or Struct or JSON column type). Data Quality and MDM Master data contains all of your business master data and can be stored in a separate dataset. This data will be shared among all other projects/datasets. This will help you to avoid duplicating master data thus reducing manageability. This will also provide a single source of truth so that different projects don't show different values for the same. As this data is very critical, we will follow type 2 slowly changing dimensional approach which will be explained my other blog in detail. https://www.unifieddatascience.com/data-modeling-techniques-for-modern-data-warehousing There are lot of MDM tools available to manage master data more appropriately but for moderate use cases, you can store this using database you are using. MDM also deals with central master data quality and how to maintain it during different life cycles of the master data. There are several data governance tools available in the market like Allation, Collibra, Informatica, Apache Atlas, Alteryx and so on. When it comes to Cloud, my experience is it’s better to use cloud native tools mentioned above should be suffice for data lakes on cloud/
Overall security architecture on GCP briefly and puts together the data lake security design
By Srinivasa Rao April 21, 2020
Overall security architecture on GCP briefly and puts together the data lake security design and implementation steps.
Show More
Share by: