准确认识系统负载

发表于 2024/04/17

versions

作者 Rong Changyu 3 分钟阅读

linux系统有个uptime命令可以显示系统当前时间、系统已运行时间、登陆用户数以及系统平均负载，输出如下所示：

  
rongchangyu:~$ uptime
16:08:28 up 385 days,  2:47,  2 users,  load average: 0.59, 0.58, 0.51

load average后的数值就是系统负载，在公司实际生产环境中，load average也是基础监控中非常重要的一个指标，可以反应服务器的负载情况。我刚接触这个概念的时候，认为这三个指标的含义为：在过去1min、5min、15min内所有可运行的(Runable)进程平均数值，当数值大于CPU核心数时就表示系统负载偏高，CPU资源不足了！。

然而随着深入的学习以及使用后发现并非如此，load average的计算方法、表示的含义并非如此，或者说不准确，只有深入理解了load average的计算才能深入理解这个指标的意义，才能更直接有效的分析系统问题、优化系统性能。

定义

通过查看uptime的手册，在控制台执行man uptime可以看到对系统平均负载的定义：

System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.

可以从中看出系统平均负载的含义：

系统在一段时间内处于可运行或不可打断状态进程的平均值。
可运行状态进程表示正在使用CPU或者等待CPU的进程；不可打断状态的进程为正在等待某些IO完成的进程，比如磁盘。

从这个定义中我们可以得出uptime的三个数值分别为系统在过去1min、5min、15min内处于可运行或不可打断状态进程的平均值。可运行状态的进程好理解，当系统需要运行的进程数大于CPU核心数量，表示CPU忙不过来了，但是进程什么时候会处于uninterruptable呢？系统负载的计算为什么需要包含uninterruptable的进程以及这三个数值真的反应的是平均值吗？

带着这些疑问分别来探索一下uninterruptable进程、系统负载计算的部分源码。

Linux Uninterruptible Tasks

待完成…

系统性能/调优

Linux System Performance

本文由作者按照 CC BY 4.0 进行授权

定义

Linux Uninterruptible Tasks

热门标签