This blog covers collecting, storing and searching and discovering various metadata across different data lake work streams on various clouds.
Technical Metadata (System generated - auto ingested)
Business Metadata (User provided/inferred)
Data Pipeline Metadata (System generated, and User provided)
Data Lineage Metadata (automated through reusable components and system level defaults - for auditing only)
Any solution we use to tackle above metadata should cover the following:
➔A unified view into all the data no matter where it is stored
➔Integration with analytical tools
➔A way to automatically build all metadata and keep it in sync with our data as it evolves
➔Should have Data Governance in place through IAM controls
➔Should have a proper API to integrate it with other tools like data pipelines.
➔Should be able to search through easily
Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud. It offers a simple and easy-to-use search interface for data discovery, a flexible and powerful cataloging system for capturing both technical and business metadata, and a strong security and compliance foundation with Cloud Data Loss Prevention (DLP) and Cloud Identity and Access Management (IAM) integrations.
AWS Glue data catalog collects metadata from different data sources like amazon RDS, S3, RedShift and Dynamo and allows users to search and discover data from AWS provided UI or through APIs.
Azure Data Catalog is a fully managed cloud service. It can collect metadata from different data sources and allows users to search and discover data.
Note: The above images are courteous to respective clouds and taken from their documentation.