Data center monitoring utilizes a mix of tools, systems, software, and solutions to keep a close eye on environmental, power, IT hardware, and security aspects of a computing facility. Relying solely on the presumed reliability and performance of a data center is optimistic. In reality, achieving and maintaining 99.999% uptime – known as “five nines” – demands a methodical and process-oriented approach.

Data center monitoring involves tracking the operational aspects, including uptime, performance, and environmental conditions of a data center, using tools and sensors. It monitors factors like temperature, humidity, and power usage, and issues alerts when these elements deviate from set thresholds.

Curious what exactly gets monitored in a data center and why it’s so crucial? Keep reading to explore the different components that the data center industry keeps an eye on, from temperature and airflow to power and security threats. We also explore the tools and systems that make data center monitoring possible, including sensors, communication protocols, and management software.

What is Data Center Monitoring?

Data center monitoring is the process of overseeing and managing the uptime, reliability, performance, environmental conditions, and security of a data center. This process utilizes a variety of tools, systems, and sensors to continuously collect and track data concerning the health of servers, storage devices, networking equipment, as well as power supply, cooling systems, and security measures.

Man Sits in Front of Computer Screens Inside a Data Center Monitoring Server Room

This monitoring covers specific factors such as temperature, humidity, airflow, power consumption, the efficiency of cooling systems, server performance, storage utilization, network traffic levels, and access control systems. Moreover, data center monitoring systems are designed to issue alerts and notifications if measurements exceed or drop below predetermined thresholds.

By monitoring these elements, data center operators, IT managers, network administrators, security professionals, system engineers, facility managers, finance and accounting teams, corporate executives, and customers of data centers, can detect and address potential issues early on. This proactive approach helps in avoiding downtime, performance degradation, data loss, and security vulnerabilities.

Components Monitored in a Data Center

Data Center with Rows of Server Racks with an Overlay of Icons for Temperature Humidity Airflow Power

Monitoring a data center involves keeping track of the following key components:

  • IT Hardware:
    • Servers: Performance metrics like processing (CPU, GPU, memory), disk usage, and temperature
    • Storage Devices: Utilization levels, read/write speeds, and health status
    • Networking Equipment: Routers, switches, and firewalls for throughput, latency, and error rates
  • Environmental Conditions:
    • Temperature: Maintain cool environment to prevent overheating
    • Humidity: Control moisture to avoid equipment damage
    • Airflow: Monitor for proper circulation and cooling efficiency
  • Power and Cooling:
    • Uninterruptible Power Supplies (UPS): Battery health, power output
    • Power Distribution Units (PDUs): Power delivery to racks and servers
    • Backup Generators: Fuel levels, operational status, and efficiency
    • Cooling Systems: Air conditioners, air handlers, effectiveness of airflow
  • Physical Security:
    • Surveillance Cameras: Monitor activity within the facility and for unauthorized access
    • Access Control Systems: Require the use of keycards or biometrics to grant access
    • Door Access: Track entry and exits, as well as lock status

Monitoring in a data center can range from a macro view of the entire floor space to a micro perspective focusing on individual racks, cabinets, and even specific outlets on a PDU, providing comprehensive oversight.

Importance of Data Center Monitoring

Data center monitoring is a 24/7 task, crucial for real-time tracking of uptime, reliability, performance, security, and more.

Importance of Data Center Monitoring Women Looks at Complex Graphs Screens showing Digital Technology

Here are the most important reasons for data center monitoring:

  1. Uptime and Reliability: Continuous monitoring helps in proactively identifying and resolving potential issues before they escalate into system outages or downtime. By closely tracking hardware failures, software crashes, and network connectivity issues through health polls, threshold-based alerts, and notifications, organizations can ensure their data centers remain highly available and reliable. This vigilance is crucial for avoiding service interruptions that could impact business operations and for adhering to service level agreement (SLA) objectives
  2. Performance Optimization: Data center monitoring tools offer valuable insights into the performance of servers, storage systems, networking equipment, and applications. These insights help pinpoint bottlenecks and inefficiencies, allowing for optimized resource allocation, effective load balancing, and timely infrastructure upgrades – with the ultimate goal of enhancing system performance
  3. Cost Management: Data center monitoring tools help in spotting underutilized resources and inefficiencies, enabling the reallocation or downsizing of resources to reduce operational expenses. Identifying “zombie” servers – those that consume power without performing any useful tasks – allows for their decommissioning in favor of more energy-efficient hardware, thus reducing power consumption and energy costs
  4. Environmental Conditions: Monitoring the environmental conditions within a data center, like temperature, humidity, and airflow, is crucial to prevent hardware damage and failure due to excessive heat or moisture. Such monitoring not only prolongs hardware lifespan but also helps data center operations stay within optimal environmental parameters, such as specific Power Usage Effectiveness (PUE) levels. For example, eliminating overcooling issues and addressing hotspots are common ways of improving energy efficiency
  5. Capacity Planning and Scalability: Data center monitoring yields insights into resource utilization, system limitations, and future needs, which are crucial for effective capacity planning and scalability. By analyzing trends and usage patterns, organizations can strategically scale compute, storage, and networking resources to meet demand, ensuring the data center supports business growth without excess provisioning or creating capacity constraints
  6. Security and Compliance: Monitoring access and activity within a data center is crucial for detecting unauthorized access, data breaches, and cybersecurity threats. It also plays a significant role in maintaining compliance with industry regulations and standards, offering detailed audit trails, logs, and records of data access and system modifications

Types of Data Center Monitoring

Data center monitoring encompasses several key areas: environmental monitoring, power monitoring, hardware monitoring, and security monitoring. Below is a detailed examination of each:

1. Environmental Monitoring

Environmental monitoring tracks the physical conditions within a data center, including temperature, humidity, airflow, water leakage, smoke, and vibration. It involves deploying numerous sensors throughout the facility to continuously monitor these parameters. Alerts and notifications are issued if any measurements exceed or drop below predetermined thresholds. This process is essential for preventing hardware damage, maintaining optimal operating conditions, and avoiding unexpected outages.

Engineers in Hard Hats Performing Environmental Monitoring Using Data On Large Display Screens

The components of data center environmental monitoring include:

Temperature

Data center monitoring tools track predefined operational thresholds for temperature by measuring both the ambient environment and specific equipment temperatures. These thresholds are crucial because high temperatures can lead to the formation of hotspots around racks, potentially damaging the servers housed within. Conversely, low temperatures can increase the risk of condensation, posing additional hazards.

To mitigate these risks, temperature sensors in data centers are used to track changes in ambient temperature. ASHRAE, an organization focused on building systems and energy efficiency, recommends keeping operational hardware within the temperature range of 64°F (18°C) to 81°F (27°C). This range helps prevent performance degradation or complete shutdown of the hardware.

For effective monitoring, the strategic placement of temperature sensors throughout the data center is critical. These sensors are placed in various locations, including racks and air conditioning vents, to assess the cooling system’s efficiency and to facilitate the early detection of temperature-related issues. Specifically, this placement strategy involves:

  • In front of racks (at the cold aisle) to measure intake temperatures
  • Behind racks (at the hot aisle) to measure exhaust temperatures
  • Near air conditioning units to monitor supply air temperature
  • At various points in the computer room to assess the overall ambient temperature

Humidity

Humidity monitoring in data centers is crucial for maintaining optimal operating conditions and preventing equipment failure. This process is enabled by hygrometers, a type of sensor, which measure the air’s moisture content.

ASHRAE suggests that the relative humidity in data centers should be around 60%, with acceptable levels ranging from 20% to 80%. In other words, it is important to keep relative humidity within these safe limits to avoid the risks associated with extreme conditions. Specifically, relative humidity above 90% can lead to condensation and corrosion, while levels below 10% increases the likelihood of electrostatic discharge.

Maintaining relative humidity within these recommended parameters is vital for preventing hardware damage, minimizing the risk of equipment failure, and avoiding the significant costs associated with downtime.

Airflow

Airflow within data centers is monitored through the use of sensors and devices that measure air velocity, temperature, and pressure. This monitoring is crucial for delivering optimal cooling across data center equipment, thereby reducing the formation of hotspots.

Data Center Computer Room with Servers that have Blue Red Airflow for Hot and Cold Air System Floor Tiles

These sensors are placed in numerous locations across the data center to allow for comprehensive monitoring. Key locations include containment areas, air transfer points, plenums, points of HVAC supply and exit, ceiling spaces, and directly within racks and cabinets. For instance, differential air pressure sensors are commonly installed at the top and bottom of racks, between aisles, between raised floor perforated tiles, and within vents and air plenums.

Airflow measurement in data centers typically employs cubic feet per minute (CFM) to quantify the volume of air circulated and meters per second (m/s) to gauge the velocity of air movement. The primary goal of monitoring airflow is to fulfill the cooling needs of IT equipment’s internal components, thus averting overheating, the development of hotspots, and the risk of thermal runaway.

Water Leakage

Water leakage in data centers is typically monitored by installing moisture sensors in key areas, such as below raised floors and underneath pipes. This placement allows for the early detection of leaks within piping systems or flooding incidents, both of which can severely damage hardware assets and lead to outages. Various factors, including air conditioning leaks, condensation, burst pipes, or local plumbing failures, can cause water leakage in data centers.

Smoke

Smoke detectors in data centers typically utilize either optical or ionization technology to detect smoke particles in the air, triggering alarms when the concentration of smoke exceeds predetermined thresholds. The presence of smoke in a data center poses significant risks, including potential harm to staff working in the facility and physical damage to hardware due to its corrosive and conductive nature. Additionally, smoke can clog air filters, reduce airflow, and contaminate cooling systems, which ultimately results in operational disruptions and downtime.

Vibration

Vibration is typically monitored using accelerometers or vibration sensors, which measure both the frequency and amplitude of vibrations. In data centers, the equipment most commonly monitored for vibrations includes racks, cabinets, hard disk drives (HDDs), cooling systems (such as fans, compressors, and pumps), power supply units (PSUs), and backup generators.

As an example, vibrations within a data center can, over time, potentially damage hard disk drives (HDDs), leading to data loss or corruption, as well as increased costs from having to replace the damaged drives. It is, therefore, important to closely monitor specific makes and models of HDDs that are known to have a higher likelihood of failure due to vibrations.

2. Power Monitoring

Power monitoring is the tracking and analysis of electrical power usage, distribution, and efficiency to help identify areas of high energy consumption and potential imbalances, allowing for optimized power usage and proactive measures against outages. It uses power meters and sensors to track power consumption of individual IT devices, specific power components, entire racks and cabinets, computer rooms, and the overall data center facility meter.

Man Wearing High Visibility Jacket and Hard Hat Holds a Tablet with Electrical Panels in the Background

Common power components monitored in data centers include:

  • Uninterruptible Power Supplies (UPS), including their battery systems
  • Power Distribution Units (PDUs)
  • Backup Generators
  • Automatic Transfer Switches (ATS)
  • Branch Circuits
  • Busways
  • Remote Power Panels (RPPs)
Google Data Center Location in Singapore with Automatic Transfer Switches ATS Panels
Source: Google. Automatic Transfer Switches (ATS) in Singapore Data Center.

Uninterruptible Power Supply (UPS) Monitoring

Uninterruptible Power Supply (UPS) monitoring involves continuously tracking the performance, battery health, and load capacity of UPS systems to ensure consistent power delivery and identify potential utility power issues. Individual UPS units often come equipped with internal monitoring systems.

UPS monitoring systems are designed to track and report on various operational parameters, such as the:

  • Input voltage level coming from the utility grid
  • Output voltage level being supplied to the connected equipment
  • Voltage of the battery bank (string voltage)
  • Remaining charge level in the UPS batteries
  • Amount of time the UPS can run on battery power (estimated runtime)
  • Internal temperature of the UPS

This monitoring is crucial for maintaining the reliability and effectiveness of UPS systems, which are used to provide emergency power to devices when the main power source fails, as well as offering some level of protection against significant voltage drops.

Power Distribution Unit (PDU) Monitoring

Power Distribution Unit (PDU) monitoring is the real-time tracking and analysis of electrical power distribution and consumption metrics at the outlet level, enabling early detection of potential power issues and improved resource allocation. Within a data center, PDUs are commonly deployed in two specific configurations: Rack PDUs and Floor PDUs, both of which can be monitored.

PDU monitoring systems are designed to track and report on various operational parameters, such as the:

  • Input voltage level coming from the power source, typically the utility grid or a UPS system
  • Output voltage level being supplied to the connected equipment
  • Current draw of individual devices on each outlet in amps
  • Total current draw of all devices, racks, and cabinets connected to the PDU in amps
  • Power consumption of the equipment connected to the PDU in watts
  • Energy consumption used over time by the equipment in kilowatt-hours (kWh)
  • Status of each outlet (on/off)

This monitoring is important for efficient energy management and preventing downtime due to electrical failures.

Power Quality Monitoring (PQM)

Power Quality Monitoring (PQM) is the process of constantly and continuously analyzing the health and stability of the electrical supply to safeguard equipment from power anomalies and improve system reliability.

Power Quality Monitoring PQM Represented by Electric Power Lines that Illuminate Sky at Twilight

PQM relies on hardware and software components to specifically track quality parameters such as:

  • Voltage: Electric potential difference between two points
  • Current: Flow of electric charge through a conductor
  • Frequency: Number of complete cycles per second in an alternating current
  • Power Factor: Efficiency of power usage; ratio of real power to apparent power in a circuit
  • Reactive Power: Power that oscillates between source and load, not used for work
  • Harmonics: Distortions in the waveform of the AC power supply
  • Transients: Sudden, short-duration voltage spikes or dips that deviate from the normal voltage levels
  • Ground Current: Current that flows through the ground in electrical systems

Monitoring these power quality factors helps facilitate efficient and reliable power delivery, ensuring the power supply remains within acceptable thresholds for data center equipment.

3. Hardware Monitoring

Hardware monitoring tracks the status and health of IT equipment within the data center, including servers, storage systems, and networking equipment. It monitors for hardware faults, software errors, capacity constraints, and performance issues, allowing for early detection and quick troubleshooting to prevent downtime.

Hardware Monitoring of Server Storage Network Devices Woman with Laptop Analyzes Computer Room

Manufacturers of IT equipment commonly embed sensors and controllers into their products. These components enable direct monitoring of a wide range of metrics from the equipment’s processor. Key metrics include power consumption, temperature levels, airflow, and resource utilization – covering CPU, memory, and I/O (Input/Output) operations.

Server Monitoring

Server monitoring is the process of continuously observing a server’s system resources like CPU usage, memory utilization, Input/Output (I/O) operations, network traffic, and application performance to ensure optimal operation and uptime. It specifically oversees the health and performance of both physical servers and virtual machines (VMs), generating alerts and notifications for issues like hardware failure or resource saturation.

Key benefits of server monitoring include:

  • Proactive Problem Identification: Identify potential issues (e.g., performance bottlenecks, disk space exhaustion) before they cause downtime or impact user experience
  • Improve Resource Utilization: Enhance efficiency by identifying overburdened or underutilized servers, facilitating resource allocation adjustments to optimize performance, which leads to cost savings

Storage Monitoring

Storage monitoring is the continuous oversight and analysis of a data center’s storage resources to deliver optimal performance, capacity management, and data integrity. It observes devices like hard disk drives (HDDs), solid-state drives (SSDs), and tape drives, as well as different storage configurations, such as storage area networks (SANs) and network-attached storage (NAS).

Network Monitoring

Network monitoring is the process of continuously observing a data center network’s performance, health, and availability to help identify bottlenecks, performance degradation, and potential security breaches within the network. It specifically oversees devices such as switches, routers, firewalls, and load balancers, as well as overall network traffic and throughput. These monitoring solutions also help track all network paths leading to and from the data center.

4. Security Monitoring

Security monitoring is dedicated to protecting data center assets from unauthorized access and cyber threats. It includes monitoring for intrusions, vulnerabilities, malware, and ensuring compliance with security policies and standards.

Security Monitoring Two Trained Staff Monitor Several Screens in a Control Room Facility
  • Physical Security Monitoring: Involves the surveillance and control of the physical environment of the data center to prevent unauthorized physical access, theft, or damage to hardware assets. Techniques include the use of surveillance cameras, security guards, access control systems (such as keycards or biometric scanners), and contact closure sensors on cabinet doors
  • Cybersecurity Monitoring: Focuses on protecting the data center’s digital assets from cyber threats such as hacking, malware, and phishing by continuously scanning for vulnerabilities and breaches
  • Network Security Monitoring: Entails the continuous analysis of network traffic and logs to detect and respond to threats or unusual activity that could indicate a security breach or an attempted attack on the network infrastructure

Tools and Systems in Data Center Monitoring

Data center monitoring tools and systems aggregate diverse datasets from various subsystems into a unified management repository and graphical user interface, enhancing visibility into unique performance metrics, facility waste, and operational issues. These tools and systems transform what would otherwise be a labor-intensive process – requiring physical inspections of different equipment and infrastructure – into a more efficient and streamlined operation.

Tools Systems Software Solutions in Data Center Monitored on Laptop that sites atop Desk Displays Graphs in Server Room

These data center monitoring tools and systems comprise both hardware and software components:

Sensors

Sensors serve as fundamental monitoring components, essential for collecting data on the environmental conditions within a data center, which is vital for understanding equipment performance and taking preventive actions to avoid damage. They track various physical conditions, including temperature, humidity, airflow, water leakage, smoke, and vibration. Typically, data collection involves polling the equipment within a data center several times per hour to gather this information.

Both wireless and wired connections are used for sensors, but the choice between them depends on several factors such as the type of sensor, its location within the data center, the criticality of the data being monitored, and the specific requirements for speed, reliability, security, and cost. TCP/IP facilitates the communication between sensors and monitoring systems in data centers, over both wired and wireless connections.

Communication Protocols

The most common communication protocols used for data center monitoring are:

  • Simple Network Management Protocol (SNMP): SNMP is used to collect and organize information about managed devices on IP networks and to modify that information to change device behavior
  • Modbus: A serial communication protocol primarily used for connecting industrial electronic devices
  • Hypertext Transfer Protocol Secure (HTTPS): Allows for access to web-based management interfaces of devices. It is used to retrieve data that might not be easily accessible through SNMP or Modbus

Data Center Infrastructure Management (DCIM)

Data Center Infrastructure Management (DCIM) software acts as a central nervous system for data center monitoring, providing a real-time and holistic view of key factors like temperature, humidity, power usage, cooling system performance, network traffic, and server performance.

DCIM stores, analyzes, and visually displays the data collected from monitoring a data center in business intelligence and analytics dashboards and reports, facilitating data-driven data center management decisions. It is also capable of sending alerts when certain operational thresholds are exceeded or conditions are met.

By utilizing a DCIM tool to monitor operational data collected from sensors, data center operators can manage and resolve events promptly, while also significantly reducing the likelihood of unexpected issues.

Typical issues that can be mitigated include improper environmental conditions, such as the development of hotspots, which are areas where the temperature is considerably higher than in surrounding locations. Hotspots create a risk of IT equipment overheating, which can result in outages and service disruptions.

READ MORE: Data Center Infrastructure Management (DCIM) – An Overview

Power and Environmental Monitoring and Control System (PEMCS)

Power and Environmental Monitoring and Control System (PEMCS) software plays a crucial role in data center monitoring by centrally analyzing and managing the power supply and environmental conditions. It is also well-suited to track the raw data required for calculating Power Usage Effectiveness (PUE), which is a ratio that measures how efficiently a data center uses energy for its computing equipment.

PEMCS acts as an umbrella system that encompasses the following software solutions (among others) used in data center monitoring:

  • Electrical Power Monitoring System (EPMS): Focuses specifically on monitoring various aspects of electrical power, including: voltage levels, current flow, power consumption, and power quality
  • Environmental Monitoring System (EMS): Concentrates on monitoring various environmental factors that can impact equipment performance and energy efficiency, such as: temperature, humidity, and airflow

On-Site and Remote Monitoring

Data center monitoring is typically conducted 24/7, both on-site and remotely, to meet the specific requirements and configurations of a data center. This monitoring usually takes place in what are commonly referred to as “command centers.”

  • On-Site Monitoring: Many data centers have a dedicated space known as a Network Operations Center (NOC). The NOC is commonly located near the main entrance to the computer room and is equipped with screens displaying real-time data about server health, network performance, environmental conditions (like temperature and humidity), and security alerts
  • Remote Monitoring: Some organizations or third-party service providers operate remote monitoring centers. These sites can monitor multiple data centers from a single location, leveraging software tools to keep an eye on system health, security breaches, and performance metrics
Mary Zhang covers Data Centers for Dgtl Infra, including Equinix (NASDAQ: EQIX), Digital Realty (NYSE: DLR), CyrusOne, CoreSite Realty, QTS Realty, Switch Inc, Iron Mountain (NYSE: IRM), Cyxtera (NASDAQ: CYXT), and many more. Within Data Centers, Mary focuses on the sub-sectors of hyperscale, enterprise / colocation, cloud service providers, and edge computing. Mary has over 5 years of experience in research and writing for Data Centers.

LEAVE A REPLY

Please enter your comment!
Please enter your name here