Organizations are awash with data. So, they build data lakes to harness the massive amounts of data coming from a variety of sources. Historically, a data warehouse was used for storing and analyzing structured data – but we now live in a world where unstructured and semi-structured data are increasingly prevalent. Given the magnitude of data generation, cloud service providers like Amazon Web Services (AWS) and Microsoft Azure are accommodating this new paradigm – delivering scale to the data lake.
Data lakes provide a centralized and scalable repository for all types of data, from multiple diverse sources. They allow organizations to more efficiently gather, prepare, and use their vast amounts of data for analytics to ultimately make informed, fact-based business decisions.
Dgtl Infra provides an in-depth overview of data lakes, which are enabling organizations to store, analyze, and utilize data from a variety of sources. These data repositories are empowering organizations to understand, refine, and analyze petabytes – and even exabytes – of information, which is constantly being generated.
As Google’s former CEO, Eric Schmidt, described the data landscape:
“There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.”
This quote is from the year 2013 – almost a decade ago. It is estimated that by 2025, the world will generate 463 exabytes of data in a single day. The data being generated is rising exponentially in what is a virtual tsunami of data.
What is a Data Lake?
A data lake is a centralized, managed, and scalable repository that stores unstructured, semi-structured, and structured data from multiple diverse sources in a flat, non-hierarchical manner. This repository can store data in its native format and process any variety of it, ignoring size limits.
Data lakes are meant to store and manage data until a user decides to utilize it for analysis, analytics, or other use cases.
To use an analogy, a natural lake collects water from various sources such as precipitation, streams, runoff, and groundwater. Similarly, data is ingested into a data lake from multiple and diverse sources, including: business applications, e-mail, images, text files, social media, clickstreams, Internet of Things (IoT) devices, sensors, and actuators.
Data Lake Diagram
Below is a high-level diagram that displays the i) various sources from where data is onboarded into a data lake and ii) use cases for which data is read from the data lake.
Importantly, this data can be ingested from any environment: on-premise, cloud, or edge computing.
READ MORE: On-Premise to Cloud Migration – a Journey to AWS and Azure
Why do Companies Use Data Lakes?
Companies use data lakes in an attempt to capture the promise of big data, which is often yet to be achieved in their business. Within small-, medium-, and large-sized companies, data is often siloed into multiple systems and departments, making it challenging for users to find and access all the data that they need.
Additionally, query performance is often limited with concurrency and scalability issues. While complex architectures also result in broken data pipelines, performance degradation, error-prone data movement, as well as security and governance risks.
Data lakes are designed to manage large volumes of big data – think petabytes (1 million gigabytes) and exabytes (1 billion gigabytes).
READ MORE: What is Data Gravity? AWS, Azure Pull Data to the Cloud
Benefits of a Data Lake
Critically, companies can move raw data, through batch and stream processing, into a data lake without transforming it. In turn, data lakes provide the following benefits to organizations:
- Total Cost of Ownership (TCO): data lakes provide an inexpensive means to store enormous amounts of data
- Scalability: data lakes can scale easily, which is becoming increasingly important as data is being generated at a staggering rate every minute, hour, and day
- Data Management: data lakes break down data silos by combining data sets from different systems into a single repository. In turn, this can simplify the process of finding relevant cross-organizational data
- Analytics: data lakes speed up the process of preparing data for analytics uses such as dashboards, visualizations, big data processing, real-time analytics, and machine learning (ML)
- Machine Learning (ML) and Artificial Intelligence (AI): data lakes act as a vehicle for advanced analytics, such as machine learning model training and data science tools, which enable artificial intelligence against the data stored in a data lake
- Governance: governance involves tracking the lineage of data, given that data is being ingested from different sources, as well as enforcing governance policies, such as data accessibility and anonymization
Use Case Example for a Data Lake – E-Commerce
E-commerce companies generate tremendous amounts of data, whether it be on products or customers, making these businesses a relevant example of how data lakes can be used.
Procurement, sales, logistics, and inventory activities of an e-commerce company produce a combination of unstructured, semi-structured, and structured data. By employing a data lake, an e-commerce company could store all of its data from these disparate sources, until it is ready to analyze and use the data to:
- Predict demand based on consumer buying patterns at given times of the year, such as specific holidays or seasons
- Plan product assortment and mix in an optimum manner
- Provide personalized product recommendations to customers
- Identify segments of customers who would be interested in new product launches based on past buying behavior
- Outline possible steps to improve order delivery and fulfillment times
What is a Data Lake vs Data Warehouse vs Database?
A data lake, data warehouse, and database are all repositories of data. However, each of these data repositories differ in meaningful ways, making a data lake, data warehouse, and database all optimized for different uses.
Key differences between a data lake, data warehouse, and database are:
- Data Lake: stores current and historical data from one or more systems in its raw format (unstructured, semi-structured, structured), with minimal or no processing. A data lake is designed to store massive amounts of data and allows for easy analysis of the data
- Data Warehouse: stores current and historical data from one or more systems in a pre-defined and fixed schema – meaning how the data is organized. This, in combination with a plan for processing the data, allows for easy analysis of the data but makes a data warehouse less flexible than a database
- Database: stores live, real-time data required to power an application, in tables with rows and columns. A database has a flexible schema – meaning how the data is organized. Since a database contains data in a standardized and predictable format, it is known as structured data
Overall, a data lake and data warehouse are used for analytics and reporting (e.g., monthly sales reports), whereas a database is designed for operations and transactions. Additionally, data lake and data warehouse data is refreshed periodically, whereas database data is current and detailed.
Comparing a data lake directly to a data warehouse, the most significant differences are their support for data types and schema flexibility. Data lakes support all data types, including unstructured, whereas data warehouses primarily store structured data. Additionally, data lakes have no schema definition, allowing them to ingest data in a variety of formats, whereas data warehouses have pre-defined and fixed schema.
Extract, Transform, Load (ETL) / Schema-on-Read
When data is read from a data lake for a given use case, a process of transforming the data, known as extract, transform, load (ETL), is carried out, which results in the flexibility of schema-on-read.
Rather than requiring data to be inputted in a pre-defined structure (like a data warehouse), the schema-on-read process of a data lake creates structure as data is being searched, allowing users to ask new and different questions at any time, without having to re-architect a schema – as would be required in a relational database.
More specifically, this process builds a schema at read time, rather than write time, and does not require pre-defined knowledge about the data it is processing. In turn, schema-on-read enables different users to run a variety of queries, regardless of changes in format of the data being inputted into the data lake.
Data Types are Driving How Data is Stored
Machine sensors, Internet of Things (IoT) devices, social media, e-mail, text messages, images, videos, and weblogs are amongst the varied sources that are presently generating incredible amounts of data.
Often referred to as big data, most of this data is either unstructured or semi-structured and represents a vast trove of untapped patterns, trends, and insights waiting to be transformed and input to machine learning (ML) programs and data analytics algorithms.
For example, social media platforms such as Facebook, Twitter, Instagram, and LinkedIn generate massive amounts of structured and unstructured data. Activity related to events and actions such as posts, likes, shares, followers, and groups represent structured data, whereas videos, images, and audio files uploaded and shared represent unstructured data.
Traditional data warehouse systems, because of their inherent schema-oriented storage systems, cannot cope with data which is unstructured or semi-structured. As such, data lakes, which can store unstructured, semi-structured, and structured data are becoming increasingly important.
Cloud-Based Data Lake Solutions
Historically, data lake solutions were built mainly on-premises. However, the economies of scale and cheap object storage that the cloud provides has resulted in cloud-based data lake solutions emerging as a cost-effective and inherently scalable option.
As such, existing data lake implementations are being migrated to the cloud and new data lake deployments are being planned on the cloud, from the ground up.
Cloud service providers (CSPs) deliver several integrated and interoperable services that facilitate:
- Gathering, ingesting, cataloging, and governing data
- Creating, securing, and managing a data lake
- Analyzing and querying data or utilizing it for any given consumption use case
Typically, CSPs provide data lake solutions that are bundled as pay-as-you-go services, which offer:
- On-demand access to business users, data scientists, or data analysts
- Interoperability with analytical tools and applications through endpoints, interfaces, and application programming interfaces (APIs)
- Security safeguards and access controls to regulate any access by end users, services, applications, or APIs
READ MORE: Top 10 Cloud Service Providers Globally in 2023
As discussed next, cloud service providers (CSPs), including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, deliver object storage, which is one of the fundamental components of a data lake.
Amazon Web Services (AWS) – Data Lake
Amazon Web Services (AWS) offers an integrated set of services for customers to implement a data lake storage and analytics solution.
For object storage, AWS utilizes Amazon S3 (Simple Storage Service). S3 is a robust and scalable object storage service that is the foundation of hosting data lakes in AWS.
Another data lake solution that is extensively used is AWS Glue, which is a data integration service that runs extract, transform, load (ETL) jobs to load data into a data lake. For example, these ETL jobs can be run as soon as new data becomes available in Amazon S3.
The following key capabilities are offered by AWS Glue at the time of creating a data lake:
- Ability to create crawlers which can discover key data sources by connecting to compatible data stores such as Amazon Relational Database Service (RDS), database instances on Amazon Elastic Compute Cloud (Amazon EC2), Amazon DynamoDB, and other S3 objects
- Stores metadata about the data being ingested into a data lake – through AWS Glue Data Catalog. Enables easy discovery of that data by end users
- Allows end users to create extract, transform, load (ETL) jobs for cleansing and transforming data
AWS Lake Formation
AWS Lake Formation is a fully managed service that enables automation of the entire process of creating a data lake. The service orchestrates the discovery, ingesting, cleansing, scrubbing, and cataloging of data. By using AWS Lake Formation an organization can reduce the time it takes to set-up an operational data lake.
Amazon Athena, Amazon EMR, and Amazon SageMaker
AWS provides the following services for when data has already been loaded into a data lake:
- Amazon Athena and Amazon EMR are services for data querying and analysis
- Amazon SageMaker, a cloud machine learning (ML) platform, makes it simple to build and deploy machine learning models with inputs from data sets stored in AWS data lakes
READ MORE: Amazon Web Services (AWS) Regions and Availability Zones
Microsoft Azure – Data Lake
Microsoft Azure has several services which enable organizations to quickly set-up data lakes and allow end users, such as data scientists and business analysts, to access and analyze the data. Uniquely, Azure provides multi-modal storage, meaning its data lake services support both:
- Object storage, which Microsoft Azure refers to as blob storage
- File shares, known as Azure Files, which are accessible via Server Message Block (SMB) protocol, Network File System (NFS) protocol, and Azure Files REST API
Azure Data Lake Storage Gen2 is Microsoft’s set of capabilities that make Azure Storage the foundation for building enterprise data lakes on Azure. Specifically, these capabilities provide file system semantics, file-level security, and scale.
Overall, Azure Data Lake Storage Gen2 is designed to store massive amounts of data, meaning multiple petabytes of information, while sustaining hundreds of gigabits per second (Gbps) of throughput.
Additionally, Microsoft Azure provides data lake analytical capabilities through the following two managed services:
- Azure Synapse Analytics: service that joins data integration, enterprise data warehousing, and big data analytics
- Azure Databricks: Apache Spark-based big data analytics service designed for data analysts, data engineers, data scientists, and machine learning engineers
READ MORE: Microsoft Azure’s Regions and Availability Zones
Google Cloud – Data Lake
Google Cloud Platform (GCP) provides several services for creating cost-efficient, scalable, and reliable data lakes. For example, Google Cloud Storage is a managed service for storing and accessing any amount of unstructured data.
Overall, Google Cloud’s data lake services provide the following capabilities:
- Securely and quickly transfer data from diverse sources into a data lake
- Facilitate easy discovery of data by end users through the cataloging of data
- Ensure interoperability with native, third-party, and open-source analytical services and engines
Google Cloud provides multiple methods to transfer data into a data lake, including:
- Dedicated Interconnect: direct physical connections between an organization’s on-premises network and Google’s network in a colocation facility
- Partner Interconnect: connectivity between an organization’s on-premises network and their Virtual Private Cloud (VPC) network through a network service provider. These connections are commonly used when an organization’s data center is in a physical location that is unable to reach a Dedicated Interconnect colocation facility
- Cloud Data Fusion: fully managed data integration service that helps users build and manage extract, transform, load (ETL) data pipelines
- Transfer Appliance: storage device with 7 terabytes to 300 terabytes of capacity that enables users to physically transfer and ship their data to Google for upload
- Pub/Sub: service used for streaming analytics and data integration pipelines to ingest and distribute data
Finally, Google Cloud provides tools such as Dataproc, Dataflow, and BigQuery, for querying, analyzing, and machine learning models in both batch and stream modes.