When we are building any scalable and high performing data lakes on cloud or on-premise, there are two broader groups of toolset and processes play critical role. One kind of toolset involves in building data pipelines and storing the data. The another set of toolset or processes does not involve directly in the data lake design and development but plays very critical role in the success of any data lake implementation like data governance and data operations.
The above diagrams show how different google managed services can be used and integrated to make it full blown and scalable data lake. You may add and remove certain tools based on the use cases, but the data lake implementation mainly moves around these concepts.
Here is the brief description about each component in the above diagrams.
1. Data Sources
The data can come from multiple desperate data sources and data lake should be able to handle all the incoming data. The following are the some of the sources:
• OLTP systems like Oracle, SQL Server, MySQL or any RDBMS.
• Various File formats like CSV, JSON, AVRO, XML, Binary and so on.
• Text based and IOT Streaming data
2. Data Ingestion
Collecting and processing the incoming data from various data sources is the critical part of any successful data lake implementation. This is actually most time consuming and resource intensive step.
GCP has various highly scalable managed services to develop and implement very complicated data pipelines of any scale.
Cloud Dataflow
Cloud Dataflow is a google fully managed service where you can build unified batch and streaming data pipelines. It is built on open source Apache beam project. It also provides horizontal scaling and tightly integrated with other Big Data components like Big Query, Pub/Sub and Stack driver.
Cloud Dataproc
Cloud Dataproc is a managed google service for Hadoop/Spark echo system. You can use Dataproc for various purposes:
• To build data pipelines using spark, especially when you have lot of code written in Spark when migrating from the on-premise.
• To do Lift and Shift existing Hadoop environment from onsite to cloud.
• If you want to use Hive and HBase databases part of your use cases.
• To build Machine learning and AI pipelines using Spark.
Dataproc clusters can be built on on-demand and also can be auto scaled depending on the need. You can also use preemptive VMs where you don’t need production scale SLAs, which costs lot less compare to using regular instances.
Cloud Composer
Cloud Composer is a fully managed workflow engine built on top of open source Apache Airflow. You can call dataflow and Dataproc spark jobs from Composer and create a controlled workflow and schedule them to run.
Cloud Data Fusion
Cloud Data Fusion is a new entry to GCP ETL/ELT stack. It is a fully managed and cloud native data pipeline tool. It allows developers build high performing data pipelines graphically. It also has huge set of open source library of preconfigured connectors that you can connect to most of the data sources. It’s kind of code free data pipeline model where you save lot of time on developing complex transformations and data ingestion.
3. Raw data layer
Object storage is central to any data lake implementation. Cloud storage serves as raw layer. You can build highly scalable and highly available data lake raw layer using Cloud storage which also provides very high SLAs.
There are also different classes of Object storage like multi-regional, regional, coldline and nearline which can be used for different purposes based on requirements.
4. Consumption layer
All the items mentioned before are internal to data lake and will not be exposed for external user. Consumption layer is where you store curated and processed data for end user consumption. The end user applications can be reports, web applications, data extracts or APIs.
The following is some of the criteria while choosing database for the consumption layer:
• Volume of the data.
• Kind of the data retrieval patterns like whether applications use analytical type of queries like using aggregations and computations or retrieves just based on some filtering.
• How the data ingestion happens whether it’s in large batches or high throughput writes (IOT or Streaming) and so on.
• Whether the data is structured, semi-structured, quasi-structured or unstructured.
• SLAs.
Cloud BigTable
BigTable is a petabyte scale NoSQL managed services database which is good for both analytical and operational workloads. It is very good for high throughput read/writes and low latency (sub 10ms) database needs. Some of the use cases for BigTable are streaming, time series data, ad tech, fintech and so on. BigTable provides eventual consistency and atomic transactions. It is little expensive compared to BigQuery and Datastore.
BigQuery
BigQuery is an important part of the google data lake stack. It is a data warehouse database, serverless managed service and can scale on petabytes of data. BigQuery is very good for analytical queries where you use lot of aggregations and computations.
BigQuery also comes with ML, GIS and BI engine features.
Cloud Datastore (Firestore)
Cloud Datastore is a highly scalable NoSQL database, ACID compliant and also allows indexing. It is also schemaless document store. Useful where you need NoSQL document DB use cases and transactions.
5. Machine Learning and Data Science
Machine Learning and Data science teams are biggest consumers of the data lake data. They use this data to train their models, forecast and use the trained models to apply for future data variables.
Cloud Dataprep
Cloud Dataprep is a managed service that will help to clean, explore and prepare data, which can be used for data analytics and machine learning. It also comes with visual graphical user interface.
Cloud Dataproc
The Spark ML which is part of the Cloud Dataproc can be used for building machine learning pipelines and to build and train data models.
Cloud AutoML
Cloud AutoML provides various machine learning tools which allows to build, and train various high-quality machine learning data models required by business. Some of the AutoML tools include vision, video intelligence, natural language, translation and tables.
AI Hub
Google provides various plug and play AI components into AI Hub.
a. Data Governance
Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing.
Data Catalog
Data Catalog is a fully managed metadata management service which can be fully integrated with other components like Cloud storage, Big Query and Pub/Sub. You can quickly discover, understand and manage the data stored in your data lake.
You can view my blog for detailed information on data catalog.
Cloud Search
Cloud search is a kind of enterprise search tool that will allow you quickly, easily, and securely find information.
Cloud IAM
Please refer to my data governance blog for more details.
Cloud KMS
Cloud KMS is a hosted KMS that lets us manage encryption keys in the cloud. We can create/generate, rotate, use, and destroy AES256 encryption keys just like we would in our on-premises environments. We can also use the cloud KMS REST API to encrypt and decrypt data.
b. Data Operations
Operations, Monitoring and Support is key part of any data lake implementations. Google provides various tools to accomplish this.
Stack driver
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information.
Please refer to my blog cloud operations for full details.
Audit Logging
Cloud Audit Logs maintains three audit logs for each Google Cloud project, folder, and organization: Admin Activity, Data Access, and System Event. Google Cloud services write audit log entries to these logs to help us answer the questions of "who did what, where, and when?" within your Google Cloud resources.
Please refer to my blog for more details.