In a digitized economy, IT service slow-downs can make a difference between making a little or a lot of money, losing or retaining customers and DevOps and business processes that are rusty and slow or well-oiled and speedy.
A new class of IT monitoring and analytics software, AIOps, is primed to address this challenge. We recently discussed how AIOps software lets you see the overall health scores for all your core and edge IT systems to make the best decisions fast. We’ll now explore how Dell’s CloudIQ AIOps software shows you unusual activity so you can take proactive action to manage and protect the storage devices where your data lives.
Acting on the Unusual with Performance Anomaly Detection
Storage systems encounter constantly changing workloads. Hence performance varies depending on the type of workload such as write-heavy (data snapshots), read-heavy (retrieving cold data) and heavy or low bandwidth functions. Storage performance of each workload type may be impacted differently under different situations such as work overload, hardware malfunction, virus or cyberattack, to name a few. Ultimately, storage performance anomalies significantly impact application performance.
Technical Solution: Performance Anomaly Detection
Three main factors distinguish CloudIQ Performance Anomaly detection:
- Accurately detecting performance impact in storage systems is challenging due to ever-changing workload patterns. A patented performance impact algorithm gives CloudIQ an edge by detecting performance impacts on workloads whose characteristics remain static in a particular range for at least an hour. Hence, it ignores transient performance impacts on workloads that are only changing over a brief time period. Hence the solution accurately detects persistent performance impacts ̶ the impacts that matter to the business.
- The solution finds performance impacts for different workload types using different types of performance metrics such as percentage of read/write and bandwidth size. Using what data scientists call a “little law and bucketing” approach, the algorithm builds the model every day to learn the drift in performance and keep the accuracy of predictions trustworthy.
- Using a unique model based on IOPS (input/output per second) and latency for each workload type, the performance impact is displayed on a simple graph showing the time, duration and size of the performance Impact. For a performance-impacted region, you can see the top three possible cause and resource contention analyses.
Performance Analytics for Other Types of Systems
CloudIQ provides Performance Anomaly Detection for key servers performance indicators, such as CPU and memory utilization, power, data protection appliances, IP network switches and SAN switches. It also monitors potential activity on those devices that could lead to anomalies such as incoming and outgoing read/write throughput, errors, link resets, congestion…