Imagine you have trauma and could figure out whats causing in under one minute. Obviously, the preference is an observability platform – but for my little wordpress site I don’t really have the budget. So I just use a few tools to isolate common issues. The idea behind this blog is to quickly isolate the fault by looking for errors and saturation metrics, as they are both easy to interpret, and then check overall resource utilisation.
Note: Some of these commands require the sysstat package installed.
1. uptime
Might seem like an odd choice, but uptime actually provides more than just uptime. It is a quick way to view the average loads (over the last 15 mins) and indicate the number of processes waiting to run.
~$ uptime
08:38:49 up 87 days, 18:31, 1 user, load average: 70.34, 25.02, 0.00
The last three blocks show a marked increase in load. The increments are 1, 5 and 15 mins sample times. So something is definitely going on… maybe my web site went viral!!!
2. dmesg | tail
dmesg views the last 10 system messages (if there are any). Look for errors that can cause performance issues. The example above includes the oom-killer, and TCP dropping a request. If you don’t know what a message means then gify. Note: you can modify the tail size by changing the numeric and you don’t need sudo if you host is properly setup (unlike mine).
$ sudo dmesg | tail 10
[ 3.453000] audit: type=1400 audit(1661436454.032:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/haveged" pid=293 comm="apparmor_parser"
[ 3.466526] audit: type=1400 audit(1661436454.044:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=294 comm="apparmor_parser"
[ 3.482004] audit: type=1400 audit(1661436454.044:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=294 comm="apparmor_parser"
[ 3.496937] audit: type=1400 audit(1661436454.044:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=294 comm="apparmor_parser"
[ 3.510178] audit: type=1400 audit(1661436454.084:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" pid=292 comm="apparmor_parser"
[ 4.310300] IPv6: ADDRCONF(NETDEV_UP): ens5: link is not ready
[ 5.223697] IPv6: ADDRCONF(NETDEV_CHANGE): ens5: link becomes ready
[ 24.859623] Adding 649996k swap on /mnt/.bitnami.swap. Priority:-2 extents:15 across:1321740k SSFS
[1440586.071042] device-mapper: uevent: version 1.0.3
[1440586.075493] device-mapper: ioctl: 4.39.0-ioctl (2018-04-03) initialised: dm-devel@redhat.com
3. vmstat 1
vmstat is short for virtual memory statistics. vmstat was run with an argument of 1 which means it will print rolling one second summaries (until you hit Ctrl + C). The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second. For now, skip the first line, unless you want to learn and remember which column is which.
Columns to check:
- r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is overloaded/saturated.
- free: Free memory in kilobytes. If there are too many digits to count, you have enough free memory. The “free -m” command, included as command 7, better explains the state of free memory.
- si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
- us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest’s own isolated driver domain).
The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.
System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.
In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead. The CPUs are also well over 90% utilized on average. This isn’t necessarily a problem; check for the degree of saturation using the “r” column.
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 322104 122480 130724 310440 0 0 1 8 2 1 0 0 100 0 0
0 0 322104 122472 130724 310440 0 0 0 4 205 397 0 0 100 0 0
0 0 322104 122472 130724 310440 0 0 0 0 187 379 0 1 100 0 0
0 0 322104 122472 130724 310440 0 0 0 0 179 373 0 0 100 0 0
0 0 322104 122472 130724 310440 0 0 0 4 187 381 0 0 99 0 0
0 0 322104 134736 130724 310440 0 0 0 0 209 391 0 1 99 0 0
0 0 322104 143156 130724 310440 0 0 0 0 176 374 0 0 100 0 0
0 0 322104 143156 130728 310440 0 0 0 28 178 366 0 0 100 0 0
0 0 322104 143156 130728 310440 0 0 0 0 171 372 0 0 100 0 0
## Now view free memory
$ free -m
total used free shared buff/cache available
Mem: 961 393 132 61 436 351
Swap: 634 314 320
4. mpstat -P ALL 1
This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a saturated single-threaded application. Nothing doing below..
## mpstat is part of sysstat - so might not be installed
$ sudo apt-get install sysstat
## Now run mpstat every 5 seconds
$ mpstat -P ALL 5
Linux 4.19.0-21-cloud-amd64 (ip-172-31-20-121) 11/21/2022 _x86_64_ (2 CPU)
10:08:22 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:08:23 AM all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:08:23 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:08:23 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
5. pidstat 1
Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.
The below example identifies two java processes as responsible for consuming CPU. The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs.
$ pidstat 5
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat
07:41:03 PM UID PID %usr %system %guest %CPU CPU Command
07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass
07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat
^C
6. iostat -xz 1
$ iostat -xz 5
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
73.96 0.00 3.73 0.03 0.06 22.21
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25
xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26
dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04
dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00
dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03
[...]
^C
This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance. Look for:
- r/s, w/s, rkB/s, wkB/s: These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.
- await: The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.
- avgqu-sz: The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)
- %util: Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.
If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work.
Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes).
7. free -m
$ free -m
total used free shared buffers cached
Mem: 245998 24545 221453 83 59 541
-/+ buffers/cache: 23944 222053
Swap: 0 0 0
The right two columns show:
- buffers: For the buffer cache, used for block device I/O.
- cached: For the page cache, used by file systems.
We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each.
The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion.
It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free -m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed.
8. sar -n DEV 1
Use this tool to check network interface throughput: rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached. In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit).
This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00).
$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00
12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00
12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C
9. sar -n TCP,ETCP 1
This is a summarized view of some key TCP metrics. These include:
- active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).
- passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).
- retrans/s: Number of TCP retransmits per second.
The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active). It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection).
Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per-second.
$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:17:19 AM active/s passive/s iseg/s oseg/s
12:17:20 AM 1.00 0.00 10233.00 18846.00
12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:20 AM 0.00 0.00 0.00 0.00 0.00
12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:21 AM 1.00 0.00 8359.00 6039.00
12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:21 AM 0.00 0.00 0.00 0.00 0.00
^C
10. netstat
For a proper rummage into the network, you cant really beat netstat. Below are a few useful calls, including a quick summary, followed by picking out MTU issues. First get a summary:
$ netstat -s
Ip:
Forwarding: 2
5143907 total packets received
4 with invalid addresses
0 forwarded
0 incoming packets discarded
5143854 incoming packets delivered
5546420 requests sent out
Icmp:
456 ICMP messages received
25 input ICMP message failed
ICMP input histogram:
destination unreachable: 446
timeout in transit: 10
0 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
IcmpMsg:
InType3: 446
InType11: 10
Tcp:
127397 active connection openings
334839 passive connection openings
16631 failed connection attempts
68477 connection resets received
1 connections established
4973994 segments received
6069615 segments sent out
875032 segments retransmitted
3229 bad segments received
92637 resets sent
InCsumErrors: 3224
Udp:
169404 packets received
0 packets to unknown port received
0 packet receive errors
169404 packets sent
0 receive buffer errors
0 send buffer errors
UdpLite:
TcpExt:
16631 resets received for embryonic SYN_RECV sockets
124 packets pruned from receive queue because of socket buffer overrun
26 ICMP packets dropped because they were out-of-window
175789 TCP sockets finished time wait in fast timer
309 packetes rejected in established connections because of timestamp
123801 delayed acks sent
132 delayed acks further delayed because of locked socket
Quick ack mode was activated 6253 times
22 SYNs to LISTEN sockets dropped
704920 packet headers predicted
1298057 acknowledgments not containing data payload received
443211 predicted acknowledgments
6 times recovered from packet loss due to fast retransmit
TCPSackRecovery: 2370
TCPSACKReneging: 5
Detected reordering 7817 times using SACK
Detected reordering 341 times using reno fast retransmit
Detected reordering 257 times using time stamp
98 congestion windows fully recovered without slow start
250 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 106
643 congestion windows recovered without slow start after partial ack
TCPLostRetransmit: 9844
23 timeouts after reno fast retransmit
TCPSackFailures: 336
357 timeouts in loss state
4267 fast retransmits
1711 retransmits in slow start
TCPTimeouts: 765878
TCPLossProbes: 11885
TCPLossProbeRecovery: 2984
TCPSackRecoveryFail: 408
TCPDSACKOldSent: 6375
TCPDSACKOfoSent: 32
TCPDSACKRecv: 3057
TCPDSACKOfoRecv: 31
4001 connections reset due to unexpected data
62649 connections reset due to early user close
1185 connections aborted due to timeout
TCPDSACKIgnoredOld: 16
TCPDSACKIgnoredNoUndo: 1905
TCPSpuriousRTOs: 30
TCPSackShifted: 704
TCPSackMerged: 2102
TCPSackShiftFallback: 17281
TCPBacklogDrop: 3
TCPDeferAcceptDrop: 276406
TCPRcvCoalesce: 230111
TCPOFOQueue: 3829
TCPOFOMerge: 32
TCPChallengeACK: 1117
TCPSYNChallenge: 5
TCPFastOpenCookieReqd: 7
TCPSpuriousRtxHostQueues: 2
TCPAutoCorking: 91106
TCPFromZeroWindowAdv: 29
TCPToZeroWindowAdv: 29
TCPWantZeroWindowAdv: 301
TCPSynRetrans: 845110
TCPOrigDataSent: 3320995
TCPHystartTrainDetect: 113
TCPHystartTrainCwnd: 3689
TCPHystartDelayDetect: 53
TCPHystartDelayCwnd: 2057
TCPACKSkippedSynRecv: 387
TCPACKSkippedPAWS: 145
TCPACKSkippedSeq: 418
TCPACKSkippedTimeWait: 179
TCPACKSkippedChallenge: 116
TCPWinProbe: 25
TCPDelivered: 3265480
TCPDeliveredCE: 4
TCPAckCompressed: 13
IpExt:
InOctets: 1448714037
OutOctets: 3058374840
InNoECTPkts: 5355501
InECT1Pkts: 298
InECT0Pkts: 63984
Now take a look at MTU, receiving and transferring packets in the kernel interface table:
$ netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
ens5 9001 5188212 0 0 0 6103306 0 0 0 BMRU
lo 65536 434754 0 0 0 434754 0 0 0 LRU
If you want to quickly test your route to the server, wrt to MTU then send a don’t fragment ping request to see if you have MTU issues. Below I am testing a 1490 packet to example.com (and its successful).
$ ping -s 1490 example.com
PING example.com (93.184.216.119) 1490(1518) bytes of data.
1498 bytes from 93.184.216.119: icmp_seq=1 ttl=51 time=1119 ms
1498 bytes from 93.184.216.119: icmp_seq=2 ttl=51 time=1130 ms
1498 bytes from 93.184.216.119: icmp_seq=3 ttl=51 time=1260 ms
11. top
The top command includes many of the metrics we checked earlier. It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable.
A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output. Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl-S to pause, Ctrl-Q to continue), and the screen clears.
$ top
top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers
KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java
4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave
66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top
5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java
4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java
1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0
8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched
- PID: Shows task’s unique process id.
- PR: The process’s priority. The lower the number, the higher the priority.
- VIRT: Total virtual memory used by the task.
- USER: User name of owner of task.
- %CPU: Represents the CPU usage.
- TIME+: CPU Time, the same as ‘TIME’, but reflecting more granularity through hundredths of a second.
- SHR: Represents the Shared Memory size (kb) used by a task.
- NI: Represents a Nice Value of task. A Negative nice value implies higher priority, and positive Nice value means lower priority.
- %MEM: Shows the Memory usage of task.
- RES: How much physical RAM the process is using, measured in kilobytes.
- COMMAND: The name of the command that started the process.
Follow-on Analysis
There are many more commands and methodologies you can apply to drill deeper. See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing.
Tackling system reliability and performance problems at web scale is one of our passions. If you would like to join us in tackling these kinds of challenges we are hiring!