Data gravity is shifting to the cloud, as providers like Amazon Web Services (AWS) and Microsoft Azure pull ever increasing volumes of data into their platforms. Meanwhile, as shown through Digital Realty’s Data Gravity Index, colocation data center providers are facilitating this data migration by acting as an on-ramp to the cloud.
Data gravity is the concept that data has mass, and the larger that mass grows, the greater its gravitational pull, but the harder it is to move. Therefore, large data volumes should be used but not moved. Instead, applications and services will move closer to the data through forces of attraction.
Dgtl Infra provides an in-depth overview of data gravity, including its meaning, the problems it presents, and its transition through hybrid cloud environments. Additionally, we review how data gravity is an opportunity for cloud service providers (CSPs) like Amazon Web Services (AWS) and Microsoft Azure. Finally, Dgtl Infra references Digital Realty’s Data Gravity Index, which measures the growing intensity and gravitational force of enterprise data creation.
What is Data Gravity?
Data gravity is the concept that, like a planet, data has mass, and the larger that mass becomes, the greater its gravitational pull, and the more likely that applications and services are attracted to it.
Historically, data was created in, or backhauled to, centralized locations for processing. However, today, data is being created and processed everywhere – in office buildings, homes, cars, as well as on every smartphone and smart device.
As enterprises create ever more data, they aggregate, store, and exchange this data, attracting progressively more applications and services to begin analyzing and processing their data. This “attraction” is caused, because these applications and services require higher bandwidth and/or lower latency access to the data.
Therefore, as data accumulates in size, instead of pushing data over networks towards applications and services, “gravity” begins pulling applications and services to the data. This process repeats, which produces a compounding effect, meaning that as the scale of data grows, it becomes “heavier” and increasingly difficult to replicate and relocate.
Ultimately, the “weight” of this data being created and stored generates a “force” that results in an inability to move the data, hence the term data gravity.
Data Gravity Problem
Data gravity presents a fundamental problem for enterprises, which is the inability to move data at-scale. Consequently, data gravity impedes enterprise workflow performance, heightens security & regulatory concerns, and increases costs.
These challenges arise because it is too slow and costly to backhaul data from various global locations to an on-premises or cloud data center. Said differently, while it may be cheap to create data, it is expensive to move that data back-and-forth, because of transit costs and data egress fees.
For example, a developer or enterprise may want to start moving data around their organization. At gigabytes and terabytes of data, the transfer process is affordable and painless, taking only minutes to hours.
Units of Data
However, once these developers and enterprises reach the scale of petabytes of data, which a number of organizations have stored in the cloud, the movement of data becomes time and cost prohibitive.
Overall, data gravity’s “problem” means that data proximity is critical. As a result, enterprises need to architect their digital infrastructure based on factors including where business units are located and where end users of data are situated. Additionally, local data storage regulations add further considerations around data privacy and compliance, which is collectively known as data localization.
Data gravity is a particularly acute problem because enterprises create and store their growing data mass in hybrid cloud environments. Hybrid cloud is where enterprises deploy applications, which utilize compute and storage, in a combination of different environments, including on-premises, private cloud, and public cloud.
READ MORE: Hybrid Cloud – What is it? and How Does it Work?
Data Gravity Examples
Below are examples from Delta Air Lines and the artificial intelligence (AI) domain, of the impacts of data gravity on hybrid cloud environments:
- Delta Air Lines: migrating 90% of its applications and databases to cloud environments by 2024 in order to improve customer experience (e.g., the booking process) and efficiency across its flight operations. However, for reasons of data gravity, security, complexity, and cost of their legacy applications, Delta chose to implement a hybrid cloud architecture, built on IBM’s Red Hat OpenShift platform. As a result, Delta will move the vast majority of its data into the public cloud, but will still maintain certain sensitive data on-premises, such as credit card information – creating data gravity in both environments
- Artificial Intelligence (AI): enterprises that put data to use for artificial intelligence purposes have different requirements about where their data needs to reside. Certain enterprises do not and cannot have this data reside in the public cloud. At the same time, data gravity makes this data very difficult to move. Therefore, artificial intelligence computation capability often needs to be “pulled” to where the enterprise’s data resides – bringing applications and services to the data
READ MORE: ChatGPT and OpenAI’s use of Azure’s Cloud Infrastructure
As highlighted in the examples above, data gravity is being created in two separate environments – public cloud and on-premises – with applications and services being attracted to each location. Additionally, enterprises will utilize colocation data centers to provide connectivity services, which enable them to connect their IT infrastructure for the purposes of exchanging traffic and accessing cloud services.
READ MORE: Colocation – Definition, Meaning, Data Center, Services
Through a physical connection, known as a cross connect, enterprises and cloud service providers can securely exchange data in multiple global metros to efficiently influence their data gravity.
Data Gravity in the Cloud
Enterprises are migrating their IT infrastructure to the cloud, given its benefits in scalability, pay-per-use, and speed & agility, but also because of the impact of data gravity. As enterprises store more data in the cloud, they begin attracting more applications and services, which need proximity to that data in the cloud.
Data Gravity Shifting to the Cloud
By 2025, Seagate Technology, a data storage company, projects that ~70% of an enterprise’s data will be stored in a combination of the cloud and the edge.
READ MORE: Top 10 Cloud Service Providers Globally in 2023
As shown in the chart above, the cloud represents the largest component of this data gravity shift, driven by large enterprises (e.g., banks, insurance companies, manufacturers) and government entities, migrating from on-premises to cloud environments.
READ MORE: On-Premise to Cloud Migration – a Journey to AWS and Azure
Once enterprises move data from on-premises locations to cloud environments, a number of new cloud services become available. Two primary examples of these cloud services, that relate to data gravity, are data warehouses and data lakes, which are defined as follows:
- Data Warehouse: central repository of information that is specifically designed for data analytics. Data warehouses hold relational data from transactional systems, operational databases, and line of business applications
- Data Lake: central repository that allows all data to be stored, including structured, semi-structured, and unstructured information
READ MORE: Data Lake – a Single Source of Truth in the Cloud
By utilizing data warehouses and data lakes, enterprises begin actively working within their chosen cloud environment. In turn, this leads to sharing, collaboration, and establishing relationships with the data, both within an enterprise, and – equally important – between different organizations. Therefore, both intracompany and intercompany dependence on the data occurs, which leads to stickiness, retention, and data gravity in the cloud.
Data Gravity in AWS
Amazon Web Services (AWS), the largest cloud service provider globally, showcases the importance of data gravity through its S3 storage service, which currently holds more than 200 trillion objects, equivalent to almost 29,000 objects for each resident on planet Earth.
AWS has formed a virtuous circle, whereby applications and services are created to analyze and process data on the platform, which in turn, creates more data that gets analyzed by other applications and services.
As previewed above, AWS offers the two primary cloud services that relate to data gravity: Amazon Redshift, its data warehouse service, and a data lake offering. Both of these services allow enterprises to store massive amounts of data within AWS.
Amazon Redshift – Data Warehouse
Amazon Redshift is AWS’ data warehouse service, which uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. The service allows customers to analyze data from terabytes to petabytes and run complex analytical queries.
AWS’ data lake service is a central repository that allows customers to store all of their structured and unstructured data at any scale. This includes data from Internet of Things (IoT) devices, websites, mobile applications, social media, and corporate applications.
READ MORE: Amazon Web Services (AWS) IoT – Connecting Devices
In a data lake architecture, enterprises can move petabytes to exabytes of data into AWS. Given this immense scale, the data becomes the center of gravity, and all users, applications, and services are pulled in, to gain access to the data.
But how do enterprises move petabytes and exabytes of data from their on-premises locations and/or colocation data centers to AWS?
AWS has created three physical appliances to more rapidly and cost-effectively ingest enormous amounts of data to AWS, through offline data transfer, as opposed to network data transfer. These physical appliances are known as AWS Snowball (42 terabytes usable), AWS Snowball Edge (80 terabytes usable), and AWS Snowmobile (up to 100 petabytes). To clarify, AWS will physically move data – not over a network, but rather shipped, or picked up and moved, to an AWS data center.
READ MORE: Amazon Web Services (AWS) Data Center Locations
These physical appliances are the most tangible example of AWS attempting to move data gravity to the cloud and, in turn, accelerate cloud adoption by established enterprises.
AWS Snowmobile – Data Gravity – Case Study
AWS Snowmobile is a data transfer service whereby up to 100 petabytes of data can be moved to AWS in every Snowmobile, which is a 45-foot long shipping container, hauled by a semi-trailer truck. See the AWS Snowmobile overview in the video below:
DigitalGlobe, a provider of high-resolution Earth imagery, was the first customer to use the AWS Snowmobile data transfer solution to solve its data gravity problem.
Over the past 20 years, DigitalGlobe has amassed more than 100 petabytes of data, with this volume growing at a rate of 10 petabytes every year. The company used the AWS Snowmobile solution to move petabytes of its archive data, from its own on-premises IT infrastructure, to the cloud, at Amazon Web Services (AWS).
Traditionally, DigitalGlobe’s multi-petabyte data transfer process would have taken months, via a 10 gigabit per second (Gbps) line, and would have been cost prohibitive. However, with AWS Snowmobile, DigitalGlobe was able to deliver petabytes of data in weeks, while saving on costs.
As a reference point, an AWS Snowmobile job costs $0.005/GB per month, based on the amount of provisioned Snowmobile storage capacity.
Data Gravity in Azure
Microsoft Azure, the second largest cloud service provider globally, views data gravity as a key performance indicator (KPI). The company highlights the Data Estate as multiple locations where an enterprise’s data is being stored, both virtually and geographically, including operational databases, data warehouses, and data lakes. Bringing the Data Estate of an enterprise onto Microsoft Azure creates a flywheel effect in the cloud, which results in a gravitational pull for more data, applications, and services.
Beyond data warehouses and data lakes, Microsoft Azure highlights its Microsoft Purview service as a driver of data gravity. To this end, on a recent earnings call, Microsoft Corporation’s Chief Executive Officer, Satya Nadella, stated that “regulatory requirements” and “governance on data” are examples of reasons why “Azure Purview becomes, again, a pretty big driver of that data gravity to the cloud from a governance perspective”.
Microsoft Purview is a combination of the former Azure Purview and Microsoft 365 Compliance portfolio. The data governance service helps enterprises manage their on-premises, multi-cloud, and software-as-a-service (SaaS) data.
For example, Microsoft Purview creates a holistic map of an enterprise’s data, with automated data discovery and sensitive data classification. Built on top of this map, are applications and services that create environments for data discovery, access management, and insights about an enterprise’s data. Data gravity becomes apparent as more of these applications and services are “pulled” towards Microsoft Purview.
Microsoft Azure has created three physical appliances to expeditiously and cost-effectively ingest massive amounts of data to Azure, through offline data transfer, as opposed to network data transfer.
As shown above, these physical appliances are known as Data Box Disk (35 terabytes usable), Data Box (80 terabytes usable), and Data Box Heavy (800 terabytes usable).
Microsoft Azure prepares and ships these devices directly to customers. Once the customer copies its data onto the Data Box appliance, they can ship the device back to an Azure data center, within the Azure region they desire.
READ MORE: Microsoft Azure’s Data Center Locations and Regions
Data Gravity Index
The Data Gravity Index is a global forecast created by Digital Realty, a data center operator, that measures the growing intensity and gravitational force of enterprise data creation. Digital Realty uses the index to help enterprises find optimal locations (i.e., data centers) to store their data, a notion that becomes increasingly important as this data mass and activity continue to scale.
Data Gravity Index Score
The Data Gravity Index Score measures the intensity and gravitational force of enterprise data growth for Global 2000 Enterprises across 53 metros and 23 industries globally. This score, as measured in gigabytes per second, provides a relative proxy for determining data creation, aggregation, and processing.
Globally, Data Gravity Intensity, as measured in gigabytes per second, is expected to grow by a 139% compound annual growth rate (CAGR) through 2024. Regionally, the Data Gravity Intensity tool forecasts the following growth rates through 2024:
Data Gravity Intensity
|Region||Compound Annual Growth Rate (CAGR)|
|Europe, the Middle East and Africa||133%|
Additionally, Digital Realty identifies certain metros as having significant data flows or “gravitational force” between each other, which is one of the key drivers for interconnection bandwidth. Given the gravitational attraction between metros that house significant volumes of data (e.g., London and Amsterdam), this creates a higher Data Gravity Intensity.
Data Gravity Index Formula
Digital Realty developed the following Data Gravity Index Formula to measure, quantify, and determine the creation, aggregation, and private exchange of enterprise data globally:
The components of this Data Gravity Index Formula can be defined as follows:
- Data Mass: volume of data that is accumulated, meaning stored or held, in a metro over a period of time
- Data Activity: amount of data that is in motion, through creation and interactions, in a metro over a period of time
- Bandwidth: average amount of bandwidth available in a metro or between two metros
- Latency: average latency between all metros and a single metro, or between two specific metros
Ultimately, bandwidth is a multiplier to data gravity, with higher bandwidth meaning a metro has more potential. While latency is an inhibitor to data gravity, with higher latency meaning a metro has less potential.
Overall, in order to use enterprise data with higher throughput and lower latency, applications and services need to be in close proximity to the data.