Data Lake

Definition

A centralised repository that stores large volumes of raw data in its native format — structured, semi-structured, and unstructured — until it is needed for analysis. Unlike data warehouses, which store data in predefined schemas, data lakes use a schema-on-read approach that provides flexibility for diverse analytical workloads including machine learning, real-time analytics, and ad hoc exploration. Data lakes are a significant technology intangible asset, with value derived from the breadth and depth of data they contain and the analytical capabilities they enable.

Complementary Terms

Concepts that frequently appear alongside Data Lake in practice.

Data Warehouse

A centralised repository of structured, processed data optimised for analytical querying and business intelligence reporting. Data warehouses use a schema-on-write approach, meaning data is cleaned, transformed, and organised into predefined structures before loading.

Training Data

The dataset used to train a machine learning model, comprising examples from which the model learns patterns, relationships, and decision boundaries. High-quality, proprietary training data is a significant competitive advantage and intangible asset, particularly in regulated industries where data scarcity creates barriers to entry.

Synthetic Data

Artificially generated data that mimics the statistical properties of real-world datasets, used to train machine learning models when actual data is scarce, sensitive, or expensive to obtain. Synthetic data enables AI development in privacy-constrained domains such as healthcare and finance, while reducing data acquisition costs and regulatory exposure.

Data Clean Room

A secure, privacy-preserving technology environment that enables multiple parties to combine and analyse their datasets without either party gaining access to the other's raw data. Data clean rooms use cryptographic techniques, aggregation rules, and access controls to enable collaborative analytics while maintaining data privacy compliance.

Customer Data Platform (CDP)

A software system that creates a unified, persistent customer database accessible to other systems by collecting and integrating customer data from multiple sources — including CRM, website analytics, email, social media, transactions, and customer service interactions. CDPs resolve customer identities across channels and devices to build comprehensive individual profiles, enabling personalised marketing, customer journey orchestration, and advanced segmentation.

Master Data Management (MDM)

The processes, governance, policies, and technology used to ensure that an organisation's critical shared data entities — such as customers, products, suppliers, and accounts — are accurate, consistent, and controlled across all systems and business units. MDM creates a single trusted source of master data, reducing duplication, resolving conflicts, and enabling reliable reporting and analytics.

Data Quality Score

A quantitative measure of data fitness for its intended use, typically assessed across dimensions including accuracy, completeness, consistency, timeliness, uniqueness, and validity. Data quality scores enable organisations to monitor and improve the reliability of their data assets, prioritise remediation efforts, and establish trust in analytical outputs.

Third-Party Data

Data collected by entities that do not have a direct relationship with the individuals whose data is being gathered, typically aggregated from multiple sources and sold to other organisations for marketing, analytics, or enrichment purposes. The value and availability of third-party data have declined sharply due to privacy regulations (GDPR, CCPA), browser restrictions on third-party cookies, and growing consumer demand for data transparency.

Data Lake

Complementary Terms

Put this knowledge to work