When we are building any scalable and high performing data lakes on cloud or on-premise, there are two broader groups of toolset and processes play critical role. One kind of toolset involves in building data pipelines and storing the data. The another set of toolset or processes does not involve directly in the data lake design and development but plays very critical role in the success of any data lake implementation like data governance and data operations.
The above diagrams show how different Amazon managed services can be used and integrated to make it full blown and scalable data lake. You may add and remove certain tools based on the use cases, but the data lake implementation mainly moves around these concepts.
Here is the brief description about each component in the above diagrams.
1. Data Sources
The data can come from multiple desperate data sources and data lake should be able to handle all the incoming data. The following are the some of the sources:
• OLTP systems like Oracle, SQL Server, MySQL or any RDBMS.
• Various File formats like CSV, JSON, AVRO, XML, Binary and so on.
• Text based and IOT Streaming data
2. Data Ingestion
Collecting and processing the incoming data from various data sources is the critical part of any successful data lake implementation. This is actually most time consuming and resource intensive step.
AWS has various highly scalable managed services to develop and implement very complicated data pipelines of any scale.
AWS Data Pipeline
AWS Data Pipeline is Amazon fully managed service where you can build unified batch and streaming data pipelines. It also provides horizontal scaling and tightly integrated with other Big Data components like Amazon Redshift, Amazon Dynamo, Amazon S3 and Amazon EMR
AWS Glue
AWS Glue is a fully managed ETL service which enables engineers to build the data pipelines for analytics very fast using its management console. You can build data pipelines using its graphical user interface (GUI) with few clicks. It automatically discovers the data and also catalog the data using AWS Glue catalog service.
AWS EMR
AWS EMR is a managed amazon cloud service for Hadoop/Spark echo system. You can use AWS EMR for various purposes:
• To build data pipelines using spark, especially when you have lot of code written in Spark when migrating from the on-premise.
• To do Lift and Shift existing Hadoop environment from onsite to cloud.
• If you want to use Hive and HBase databases part of your use cases.
• To build Machine learning and AI pipelines using Spark.
AWS EMR clusters can be built on on-demand and also can be auto scaled depending on the need. You can also use spot instances where you don’t need production scale SLAs, which costs lot less compare to using regular instances.
3. Raw data layer
Object storage is central to any data lake implementation. AWS S3 serves as raw layer. You can build highly scalable and highly available data lake raw layer using AWS S3 which also provides very high SLAs.
It also comes with various storage classes like S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One Zone-IA and S3 Glacier, which are used for various use cases and to meet different SLAs.
4. Consumption layer
All the items mentioned before are internal to data lake and will not be exposed for external user. Consumption layer is where you store curated and processed data for end user consumption. The end user applications can be reports, web applications, data extracts or APIs.
The following is some of the criteria while choosing database for the consumption layer:
• Kind of the data retrieval patterns like whether applications use analytical type of queries like using aggregations and computations or retrieves just based on some filtering.
• How the data ingestion happens whether it’s in large batches or high throughput writes (IOT or Streaming) and so on.
• Whether the data is structured, semi-structured, quasi-structured or unstructured.
AWS Dynamo
Amazon Dynamo is a distributed wide column NoSQL database can be used by application where it needs consistent and millisecond latency at any scale. It is fully managed and can be used for document and wide column data models. It also supports flexible schema and can be used for web, ecommerce, streaming, gaming and IOT use cases.
Amazon Redshift
Amazon Redshift is a fast, fully managed analytical data warehouse database service scales over petabytes of data. Amazon Redshift provides a standard SQL interface that lets organizations use existing business intelligence and reporting tools. Amazon Redshift is a columnar database and distributed over multiple nodes allows to process requests parallel on multiple nodes.
Amazon DocumentDB
Amazon DocumentDB is a fully managed document-oriented database service which supports JSON data workloads. It is MongoDB compatible. Its fast, high available and scales over huge amounts of data.
5. Machine Learning and Data Science
Machine Learning and Data science teams are biggest consumers of the data lake data. They use this data to train their models, forecast and use the trained models to apply for future data variables.
Amazon Glue
Please refer above
Amazon EMR
Please refer above
Amazon SageMaker
Amazon SageMaker can be used to quickly build, train and deploy machine learning models at scale; or build custom models with support for all the popular open-source frameworks.
Amazon ML and AI
Amazon has huge set of robust and scalable Artificial Intelligence and Machine Learning tools. It also provides pre-trained AI services for computer vision, language, recommendations, and forecasting.
a. Data Governance
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing.
Amazon Glue Catalog
Amazon Glue Catalog is a fully managed metadata management service which can be fully integrated with other components like Data Pipelines, Amazon S3 and so on. You can quickly discover, understand and manage the data stored in your data lake.
You can view my blog for detailed information on data catalog.
Amazon CloudSearch
Cloud search is a kind of enterprise search tool that will allow you quickly, easily, and securely find information.
AWS IAM
Please refer to my data governance blog for more details.
AWS KMS
AWS KMS is a hosted KMS that lets us manage encryption keys in the cloud. We can create/generate, rotate, use, and destroy AES256 encryption keys just like we would in our on-premises environments. We can also use the cloud KMS REST API to encrypt and decrypt data.
b. Data Operations
Operations, Monitoring and Support is key part of any data lake implementations. AWS provides various tools to accomplish this.
AWS CloudTrail
AWS offers CloudTrail, a comprehensive set of services for collecting data on the state of applications and infrastructure.
Please refer to my blog cloud operations for full details.
AWS CloudWatch
AWS CloudWatch Logs maintains three audit logs for each AWS project, folder, and organization: Admin Activity, Data Access, and System Event. AWS write audit log entries to these logs to help us answer the questions of "who did what, where, and when?" within your AWS Cloud resources.
Please refer to my blog for more details.