Linux性能指标–CPU | 菜鸟的成长之路

系统负载

使用top或者uptime命令，可以了解系统的负载情况。比如：
```
$update
20:12:52 up 4 days, 18:49,  1 user,  load average: 0.00, 0.01, 0.05
```
输出的每列含义是：
```
20:12:52                        // 当前系统时间
up 4 days, 18:49,               // 系统运行时间
1 user,                         // 正在登陆的用户数
load average: 0.00, 0.01, 0.05  // 过去一分钟、五分钟、十五分钟的平均负载（Load Average）
```
那什么是平均负载？用man uptime命令可以了解平均负载的详细解释：

System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.

平均负载就是在单位时间内，处于可运行状态或不可中断状态的平均进程数。它与CPU使用率没有直接关系。最理想状态，就是每个CPU上都运行着一个进程，这样每个CPU都充分被利用。比如当平均负载是1时，意味着：
- 在单CPU系统上，CPU刚好被充分利用
- 在4个CPU的系统上，CPU有75%的空闲
综上，平均负载最理想的情况是等于CPU个数，所以在判断平均负载时，首先需要知道系统有几个CPU。可以通过读取/proc/cpuinfo来获得：
```
$grep 'processor' /proc/cpuinfo | wc -l
2
```
系统提供了1分钟，5分钟，15分钟的平均负载，到底要参考哪一个呢？答案是：都要看！

三个不同时间的平均负载，实际上是提供了分析系统负责趋势的数据。
- 如果三个值基本相同或者相差不大，那说明系统负载很平稳
- 如果1分钟的值远小于15分钟的值，那说明最近1分钟的负载在降低，而过去15分钟的负载却很大
- 如果1分钟的值远大于15分钟的值，那说明最近1分钟负载在增加，可以是临时性的，也可以是持续增加，这就需要观察。一旦1分钟的平均负载接近或者超过CPU个数，代表系统即将发生过载，这就需要调查原因，并想法优化

CPU使用率

CPU使用率是指单位时间内非空闲时间占总CPU时间的百分比。跟平均负载并不一定完全对应：

CPU密集型进程，使用大量CPU会导致平均负载升高，此时两者是一致的
I/O密集型进程，等待I/O也会导致平均负载升高，但CPU使用率不一定升高
大量等待CPU的进程调度也会导致平均负载升高，此时CPU使用率也会比较高

CPU使用率的计算方法，可以通过读取/proc/stat来计算：

$cat /proc/stat
cpu  69184 164 66663 84909542 3705 0 777 0 0 0
cpu0 39599 75 42226 42443240 2909 0 267 0 0 0
cpu1 29585 89 24437 42466301 795 0 509 0 0 0
intr 23997881 47 10 0 0 0 0 0 0 1 0 0 0 16 0 0 ........
.................
softirq 16878127 0 7208724 85064 1158150 260408 0 50964 3994226 0 4120591

CPU行是各CPU汇总的时间（CPU1+CPU2），显示的是从系统启动到现在相应项目累计所使用的时间，单位是时间片（ Jiffies）。

列	释义
列1 – user（69184）	用户态CPU时间，不含nice值为负的用户进程
列2 – nice（164）	nice值为负的进程所占的CPU时间
列3 – system（66663）	内核CPU时间
列4 – idle（84909542）	除I/O外的等待时间
列5 – iowait（3705）	I/O等待的时间
列6 – irq（0）	中断时间
列7 – softirq（777）	软中断时间

因为stat文件是统计从系统启动到现在所经过时间的累计值，所以可以在两点t1和t2取值进行比较计算，所得的就是t2-t1时间段内CPU的使用率。当t2-t1很小时，就是计算CPU的即时使用率。

CPU在t1到t2时间段总的使用时间 = ( user2+ nice2+ system2+ idle2+ iowait2+ irq2+ softirq2) – ( user1+ nice1+ system1+ idle1+ iowait1+ irq1+ softirq1)

CPU在t1到t2时间段空闲使用时间 = (idle2 – idle1)

CPU在t1到t2时间段即时利用率 = 1 – CPU空闲使用时间 / CPU总的使用时间