Blog Post

Data modeling techniques for modern data warehousing

Srinivasa Rao • April 11, 2020

Big Data presentation layer (Operational)

1.Introduction

This document presents various data modeling techniques that may be followed for your presentation layer of operational data warehouse or data lake. It presents various techniques for data modeling using google Big Query and similar technologies like Redshift, Snowflake, Azure SQL data warehouse, Couchbase, MongoDB and so on. Based on use cases and usage patterns, these techniques can be adopted for different modules and projects within our data lake sphere.

The presented data model techniques are a combination of denormalized (Flattened), Dimensional and Set/Relational models that should suit all your use cases.

These data modeling techniques will be used for operational Big Query or similar datasets from where the end users access data through various tools like Tableau.

2. Types of Data

Business master data

This data contains all of your business master data and will be stored in a separate dataset. This data will be shared among all other projects/datasets.

This will help you to avoid duplicating master data thus reducing manageability. This will also provide a single source of truth so that different projects don't show different values for the same.

As this data is very critical, we will follow type 2 slowly changing dimensional approach which will be explained later in this document.

There are lot of MDM tools available to manage master data more appropriately but for moderate use cases, you can store this using database you are using.

ETL or System (meta)data

This metadata is neither business metadata or the technical metadata but may need to store some data that will help your ETL/ELT workflows and various other systems. This may help in reducing hard coding within the ETL/ELT pipelines and any other systems. The example of this data includes external APIs, FTP/SFTP site information and their access credentials and any other reference data. When that information changes, you can update corresponding tables and don’t need to modify jobs and other references where it's being used.

This data will not be used by any end users or reports.

Transactional data

This is the main project specific data which will have all the data related to various projects which includes historical data as per the business retention requirements.

Different datasets will be created for different projects. Generally, this data along with business metadata will be consumed by end user applications like Tableau reports.

Metadata and Cataloging

This is entirely different kind of metadata and to avoid confusion, this is handled separately. This data is not stored in any tables and not used by end users or any systems/applications.

All the tables need to have the following labels so that we can integrate it with metadata catalog tool, so that users can search and view the data easily.

3 Slowly changing dimensions

Handling slowly changing dimensions is the key for any successful operational data lake or data warehousing implementations. It's a little harder to implement this for traditional databases like Oracle and SQL Servers but as most of the modern databases like Big Query allow complex types, so it became a little easier to handle slowly changing dimension types in the modern Big Data type databases.

3.1 Traditional data warehouse implementations of SCDs

In the traditional Data warehouse implementations, the following are the 3 types of slowly changing dimensions:

Type 1 SCDs - Overwriting

In a Type 1 SCD the new data overwrites the existing data. Thus, the existing data is lost as it is not stored anywhere else. This is the default type of dimension we create. We do not need to specify any additional information to create a Type 1 SCD.

Type 2 SCDs - Creating another dimension record

A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active.

Type 3 SCDs - Creating a current value field

A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

3.2 Big Data Implementation of SCDs

In the modern data lake implementations, we can club together Type 2 and Type 3 into one, we can call it as “retaining historical values”.

So, there are only 2 types of SCDs in Big Data implementations

Overwriting

In this type, the new data overwrites the existing data. Thus, the existing data is lost as it is not stored anywhere else. This is the default type of dimension we create. We do not need to specify any additional information to create this type of dimension.

Retain historical values

In this type, we retain historical changes using complex data types available within Big Data Technologies. This is some kind of semi structured type and will have full control of up to what extent we can retain historical changes. JSON, MAP, STRUCT are some example types of storage options.

This will give full freedom on how much retention we can use and any associated values like when it was changed and how it was changed and so on. Also, this can be different for different dimensions.

How to handle Type 2 and Type 3 together

Type 2 and Type 3 (in the big data world, “retaining historical values”) can be handled in Big Query and similar technologies using complex data types. Wherever we need to retain historical data for any dimension, we need to include the following column to be created along with dimensional tables and be managed through data flow jobs.

SCD_History ARRAY<STRUCT<updatedon DATETIME or TIMESTAMP, columnname STRING, oldvalue STRING, newValue STRING>>

the above syntax is for Big Query. You need to find similar data type for different type of databases like Redshift, Snowflake, Couchbase and so on. For example, Cassandra you can use MAP data type, MongoDB, you can use nested JSON and so on. Where complex data type not allowed, you can use STRING or TEXT types and use JSON format.

Supplier_Key Supplier_Code Supplier_Name Supplier_State SCD_History
123 ABC Acme Supply Co IL [{updatedOn: 22-Dec-2004, columnName: Supplier_State, OldValue: CA, newValue: IL}]
You can use the UNNEST operator available with BigQuery SQL to flatten out the history column in case we need to report or search values within this complex column type.

This will also give users flexibility like whether to use current value or the value at that point in time.

It is important to know this method is used for slowly changing dimensions that means the changes (or updates) happen very infrequently over time.

The following are some key points to keep in mind while adding SCD specific columns:
  • Whether the old or historical values used by end user applications or just for auditing.
  • What columns we need to track for changes over time.

4. Data modelling techniques

The following are the various data modeling techniques that can be used to implement various use cases within your data lake (presentation layer not in raw layer). Which technique you use will be decided based on the requirements like how the data is accessed by end users and how the data gets into our database like Big Query.

Different use cases, projects or modules within our projects can follow one or more of the below modeling techniques to satisfy the requirements.

4.1 Denormalized or flattened models

In this modeling technique, we denormalize and flatten fact with its dimensional data. You should be careful in choosing this approach and this technique works well with very simple use cases and does not work with most of the complex data warehouse needs.

Advantages of using flattened denormalized model

Simple to query which avoids joins.

As all the data stored in one table, it's easy to query and don’t need to join many tables.

Perform well with latest Big Data technologies including Big Query.

Most of the Big Data technologies which are all distributed in nature prefer this kind of data model and performs very well.

Disadvantages of using flattened denormalized model

Here are some of the reasons why this technique does not work with most of the use cases:

Where dimensional data changes

When the dimensional data changes, we need to update millions of rows within a big denormalized table. It is very time consuming; we cannot use this technique when dimensional data constantly changes.

Where there are too many attributes

Some dimensions are simple like states which will have state code and name, but most of the dimensions will have too many attributes like Patient. Patient dimension will have too many attributes like id, name, date of birth, address, insurance and so on. It's difficult to tuck all the information into a single denormalized flattened table.

Dimensional data is nested and has complex hierarchies.

Sometimes dimensional data is nested like Region dimension will have regions, countries within region and states within countries and cities within states. It's very difficult to maintain all that data into one single wide table.

Where they need to keep track of changes

Most of the time we need to handle historical changes to dimension data which will be difficult to handle if we go with denormalized flattened approach.

Users need to see dimensional values in drop-down lists

Most of the time when users use front-end applications including reporting tools like Tableau, they need to see dimensional values in the drop-downs to filter out and/or slice and dice data, which will be a huge process to get all distinct values for that dimension from millions or billions of rows from flattened tables.

4.2 Star Schema (modified)

Big Query, Snowflake and some data warehouse databases allows joins among tables, so Modified Star Schema approach with some changes may be a good data modeling approach for our data lake implementations. Where the joins not allowed, might requires multiple queries to achieve same if we use this model, for example Cassandra.

Here are some changes we can make to the traditional star schema approach.

Club together Type 2 and Type 3 slowly changing dimensions when we need to track historical changes. This is explained above under slowly changing dimensions.

Club together similar dimensions into one big denormalized dimension table.

Example, using state, country and region dimensions separately, using one-dimension table can be called “Geography” or “Location” which will have region, country and state information together will perform better and a good approach.

Grouping together multiple facts into one big fact table

Similarly, we can group together multiple fact tables into single or less number of denormalized fact tables, so that users don't need to use multiple fact tables in their queries and instead use a single flattened fact table that will speed up the query performance.

The whole idea is to use denormalized/flattened schema model as much as possible which will reduce the number of tables and thus joins within queries. At the same time, the modified star schema approach which addresses shortcomings that we mentioned earlier while using fully denormalized flattened schema approach.

There are other kinds of data modeling techniques that might suit certain operational data lake needs may be considered based on the use case and requirements.

4.3 Hybrid Star/Denormalized

In this case we use denormalized schema models which tie together facts and certain dimensions together. The dimension that will never change and simple will go along with the fact table and complex dimensions and which will change constantly will be kept separate.

Sometimes we can keep part of the dimension that is used by end user applications or reports will become part of the fact table and other parts of the dimension will stay in separate tables.

4.4 Nested dimensions within fact

You can even keep the nested dimensions that will not change over time be part of the fact table using complex data types like ARRAY<STRUCT>> in case of Big Query or JSON (MongoDB) or MAP(Cassandra) columns in case of any other NoSQL databases.

4.5 Pattern (Query) specific data model

This is another kind of data modeling technique that is used when using NoSQL technologies like Cassandra, HBase, Dynamo and so on where “Joins” among tables are not allowed.

The table is designed based on the query usage patterns. This model helps to serve data very fast but complex to maintain. Also, there will be too much redundancy.

"About Author"

The author has extensive experience in Big Data Technologies and worked in the IT industry for over 25 years at various capacities after completing his BS and MS in computer science and data science respectively. He is certified cloud architect and holds several certifications from Microsoft and Google. Please contact him at srao@unifieddatascience.com if any questions.
By Srinivasa Rao June 19, 2023
Database types Realtime DB The database should be able to scale and keep up with the huge amounts of data that are coming in from streaming services like Kafka, IoT and so on. The SLA for latencies should be in milliseconds to very low seconds. The users also should be able to query the real time data and get millisecond or sub-second response times. Data Warehouse (Analytics) A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. The data is generally stored in denormalized form using Star or Snowflake schema. Data warehouse is used in a little broader scope, I would say we are trying to address Data Marts here which is a subset of the data warehouse and addresses a particular segment rather than addressing the whole enterprise. In this use case, the users not only query the real time data but also do some analytics, machine learning and reporting. OLAP OLAP is a kind of data structure where the data is stored in multi-dimensional cubes. The values (or measures) are stored at the intersection of the coordinates of all the dimensions.
By Srinivasa Rao June 18, 2023
This blog puts together Infrastructure and platform architecture for modern data lake. The following are taken into consideration while designing the architecture: Should be portable to any cloud and on-prem with minimal changes. Most of the technologies and processing will happen on Kubernetes so that it can be run on any Kubernetes cluster on any cloud or on-prem. All the technologies and processes use auto scaling features so that it will allocate and use resources minimally possible at any given time without compromising the end results. It will take advantage of spot instances and cost-effective features and technologies wherever possible to minimize the cost. It will use open-source technologies to save licensing costs. It will auto provision most of the technologies like Argo workflows, Spark, Jupyterhub (Dev environment for ML) and so on, which will minimize the use of the provider specific managed services. This will not only save money but also can be portable to any cloud or multi-cloud including on-prem. Concept The entire Infrastructure and Platform for modern data lakes and data platform consists of 3 main Parts at very higher level: Code Repository Compute Object store The main concept behind this design is “Work anywhere at any scale” with low cost and more efficiently. This design should work on any cloud like AWS, Azure or GCP and on on-premises. The entire infrastructure is reproducible on any cloud or on-premises platform and make it work with some minimal modifications to code. Below is the design diagram on how different parts interact with each other. The only pre-requisite to implement this is Kubernetes cluster and Object store.
By Srinivasa Rao June 17, 2023
Spark-On-Kubernetes is growing in adoption across the ML Platform and Data engineering. The goal of this blog is to create a multi-tenant Jupyter notebook server with built-in interactive Spark sessions support with Spark executors distributed as Kubernetes pods. Problem Statement Some of the disadvantages of using Hadoop (Big Data) clusters like Cloudera and EMR: Requires designing and build clusters which takes a lot of time and effort. Maintenance and support. Shared environment. Expensive as there are a lot of overheads like master nodes and so on. Not very flexible as different teams need different libraries. Different cloud technologies and on-premises come with different sets of big data implementations. Cannot be used for a large pool of users. Proposed solution The proposed solution contains 2 parts, which will work together to provide a complete solution. This will be implemented on Kubernetes so that it can work on any cloud or on-premises in the same fashion. I. Multi-tenant Jupyterhub JupyterHub allows users to interact with a computing environment through a webpage. As most devices have access to a web browser, JupyterHub makes it easy to provide and standardize the computing environment of a group of people (e.g., for a class of data scientists or an analytics team). This project will help us to set up our own JupyterHub on a cloud and leverage the cloud's scalable nature to support large groups of users. Thanks to Kubernetes, we are not tied to a specific cloud provider. II. Spark on Kubernetes (SPOK) Users can spin their own spark resources by creating sparkSession. Users can request several executors, cores per executor, memory per executor and driver memory along with other options. The Spark environment will be ready within a few seconds. Dynamic allocation will be used if none of those options are chosen. All the computes will be terminated if they’re idle for 30 minutes (or can be set by the user). The code will be saved to persistent storage and available when the user logs-in next time. Data Flow Diagram
Data lake design patterns on cloud. Build scalable and highly performing data lake on  Azure
By Srinivasa Rao May 9, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Microsoft (Azure) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on  AWS (Amazon)
By Srinivasa Rao May 8, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Amazon (AWS) cloud.
Data lake design patterns on cloud. Build scalable and highly performing data lake on google (GCP)
By Srinivasa Rao May 7, 2020
Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the google (GCP) cloud.
Different strategies to fully implement DR and BCP across the GCP toolset and resources.
By Srinivasa Rao April 23, 2020
Different strategies to fully implement DR and BCP across the toolset and resources you are currently using and probably will use in near future on GCP.
Monitoring, Operations, Alerts and Notification and Support on Cloud
By Srinivasa Rao April 23, 2020
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information
By Srinivasa Rao April 22, 2020
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing. Security Covers overall security and IAM, Encryption, Data Access controls and related stuff. Please visit my blog for detailed information and implementation on cloud. https://www.unifieddatascience.com/security-architecture-for-google-cloud-datalakes Data Cataloging and Metadata It revolves around various metadata including technical, business and data pipeline (ETL, dataflow) metadata. Please refer to my blog for detailed information and how to implement it on Cloud. https://www.unifieddatascience.com/data-cataloging-metadata-on-cloud Data Discovery It is part of the data cataloging which explained in the last section. Auditing It is important to audit is consuming and accessing the data stored in the data lakes, which is another critical part of the data governance. Data Lineage There is no tool that can capture data lineage at various levels. Some of the Data lineage can be tracked through data cataloging and other lineage information can be tracked through few dedicated columns within actual tables. Most of the Big Data databases support complex column type, it can be tracked easily without much complexity. The following are some examples of data lineage information that can be tracked through separate columns within each table wherever required. 1. Data last updated/created (add last updated and create timestamp to each row). 2. Who updated the data (data pipeline, job name, username and so on - Use Map or Struct or JSON column type)? 3. How data was modified or added (storing update history where required - Use Map or Struct or JSON column type). Data Quality and MDM Master data contains all of your business master data and can be stored in a separate dataset. This data will be shared among all other projects/datasets. This will help you to avoid duplicating master data thus reducing manageability. This will also provide a single source of truth so that different projects don't show different values for the same. As this data is very critical, we will follow type 2 slowly changing dimensional approach which will be explained my other blog in detail. https://www.unifieddatascience.com/data-modeling-techniques-for-modern-data-warehousing There are lot of MDM tools available to manage master data more appropriately but for moderate use cases, you can store this using database you are using. MDM also deals with central master data quality and how to maintain it during different life cycles of the master data. There are several data governance tools available in the market like Allation, Collibra, Informatica, Apache Atlas, Alteryx and so on. When it comes to Cloud, my experience is it’s better to use cloud native tools mentioned above should be suffice for data lakes on cloud/
Show More
Share by: