When we are building any scalable and high performing data lakes on cloud or on-premise, there are two broader groups of toolset and processes play critical role. One kind of toolset involves in building data pipelines and storing the data. The another set of toolset or processes does not involve directly in the data lake design and development but plays very critical role in the success of any data lake implementation like data governance and data operations.
The above diagrams show how different Azure managed services can be used and integrated to make it full blown and scalable data lake. You may add and remove certain tools based on the use cases, but the data lake implementation mainly moves around these concepts.
Here is the brief description about each component in the above diagrams.
1. Data Sources
The data can come from multiple desperate data sources and data lake should be able to handle all the incoming data. The following are the some of the sources:
• OLTP systems like Oracle, SQL Server, MySQL or any RDBMS.
• Various File formats like CSV, JSON, AVRO, XML, Binary and so on.
• Text based and IOT Streaming data
2. Data Ingestion
Collecting and processing the incoming data from various data sources is the critical part of any successful data lake implementation. This is actually most time consuming and resource intensive step.
Azure has various highly scalable managed services to develop and implement very complicated data pipelines of any scale.
Azure Data Factory
Azure Data Factory is a Microsoft fully managed service where you can build unified batch and streaming data pipelines. It also provides horizontal scaling and tightly integrated with other Big Data components like SQL DataWarehouse, Azure CosmosDB, Azure Table Storage and Event Hub.
Azure HDInsight
Azure HDInsight is a managed Azure service for Hadoop/Spark echo system. You can use HDInsight for various purposes:
• To build data pipelines using spark, especially when you have lot of code written in Spark when migrating from the on-premise.
• To do Lift and Shift existing Hadoop environment from onsite to cloud.
• If you want to use Hive and HBase databases part of your use cases.
• To build Machine learning and AI pipelines using Spark.
HDInsight clusters can be built on on-demand and also can be auto scaled depending on the need. You can also use preemptive VMs where you don’t need production scale SLAs, which costs lot less compare to using regular instances.
3. Raw data layer
Object storage is central to any data lake implementation. Azure BLOB and Azure Data Lake Storage serves as raw layer. You can build highly scalable and highly available data lake raw layer using Azure BLOB which also provides very high SLAs.
Azure Storage supports three types of BLOBs, Block, Append and Page BLOBs.
https://docs.microsoft.com/en-us/azure/storage/blobs/
4. Consumption layer
All the items mentioned before are internal to data lake and will not be exposed for external user. Consumption layer is where you store curated and processed data for end user consumption. The end user applications can be reports, web applications, data extracts or APIs.
The following is some of the criteria while choosing database for the consumption layer:
• Kind of the data retrieval patterns like whether applications use analytical type of queries like using aggregations and computations or retrieves just based on some filtering.
• How the data ingestion happens whether it’s in large batches or high throughput writes (IOT or Streaming) and so on.
• Whether the data is structured, semi-structured, quasi-structured or unstructured.
Azure SQL Data Warehouse (Synapse Analytics)
Azure SQL Data Warehouse is managed analytical service that brings together enterprise data warehouse and Big Data analytics. SQL Data Warehouse is an important part of the Azure data lake stack. It is a data warehouse database, serverless managed service and can scale on petabytes of data. SQL Data Warehouse is very good for analytical queries where you use lot of aggregations and computations. It is renamed as Synapse Analytics.
Azure CosmosDB
Azure Cosmos DB is a managed NoSQL database available on Azure cloud which provides low latency, high availability and scalability. This will allow to migrate MongoDB, Cassandra and other NoSQL workloads to the cloud.
Azure Table Storage
Azure Table Storage is a wide column NoSQL database scales over petabytes of data and provides semi structured schema model. It is also Azure fully managed service.
5. Machine Learning and Data Science
Machine Learning and Data science teams are biggest consumers of the data lake data. They use this data to train their models, forecast and use the trained models to apply for future data variables.
Azure Data Explorer
Azure Data Explorer is a fully managed Azure data analytics service for real-time analysis on large volumes of data streaming from applications, websites, IoT devices, and more. Ask questions and iteratively explore data on the fly to improve products, enhance customer experiences, monitor devices, and boost operations. Quickly identify patterns, anomalies, and trends in your data. Explore new questions and get answers in minutes. Run as many queries as you need, thanks to the optimized cost structure.
Azure HDInsight
Please see above
Azure Machine Learning
Azure Machine Learning service is a huge combination of ML/AI library, where you can build and deploy Machine Learning models quickly and easily. It also has very good support for open-source frameworks and languages including MLflow, Kubeflow, ONNX, PyTorch, TensorFlow, Python, and R
a. Data Governance
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing.
Azure Data Catalog
Azure Data Catalog is a fully managed metadata management service which can be fully integrated with other components like Azure BLOB, SQL Data Warehouse and Azure Table Storage. You can quickly discover, understand and manage the data stored in your data lake.
You can view my blog for detailed information on data catalog.
Azure Search
Azure Search is a kind of enterprise search tool that will allow you quickly, easily, and securely find information.
Azure Active Directory
Please refer to my data governance blog for more details.
Azure Key Vault
Azure Key Vault is a hosted KMS that lets us manage encryption keys in the cloud. You can use Azure Key Vault to encrypt keys and secrets that use keys stored in hardware security modules (HSM).
b. Data Operations
Operations, Monitoring and Support is key part of any data lake implementations. Azure provides various tools to accomplish this.
Azure Monitor
Azure Cloud Platform offers Azure Monitor, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information.
Please refer to my blog cloud operations for full details.