There are different reasons why a program under Unix/Linux could be running unexpectedly slowly when moved to new hardware (that is nothing to do with Halloween!), particularly for C or C++. That is, the program normally runs well, but for some reason when in production it behaves badly. Here is what I do personally to try and work out such problems, particular for a system I am not familiar with.
Can I Repeat the Problem Reliably?
Do you know how to repeat the problem? It is a huge step forwards if you can! It is worth investing a bit of time here to try and get a repeatable test case.
Does the program work in one environment and not another? For example it works in development, but not in production? If that is the case, what is different between the two environments? The network is likely to be different (e.g. different firewalls). Also compare the hardware – memory, CPU power, disk subsystems. This might give a clue.
If you are using a library as part of a larger application and you have evidence (e.g. print statements in the code) indicating that library is the problem, can you build a small test case to repeat the problem? If so, you have just eliminated the rest of the application.
What is the Bottleneck?
I like starting with understanding the bottleneck. This often gives a great clue as to where the root problem is.
- Is it CPU? Is there something about what the program is doing that sometimes causes it to run slowly (e.g. a N-cubed algorithm that gets exponentially worse for particular forms of input).
- Is it memory? Maybe the machine has run out of memory and is thrashing.
- Is it disk? Maybe the disk is full (writes blocking/failing) or nearly full (some file systems slow down when nearly full). Or maybe there is too much disk I/O going on.
- Is it network? Network throughput is rarely an issue these days, but if a program is running poorly in a production environment, is it DNS or firewalls causing a network timeout.
The next step is to try and work out which of the above situations is going on.
Running a command such as ‘top’ is an easy starting point. Is the program using up all of a CPU? Is something else? If a program runs on one machine OK but not on another, I have normally found the problem to not be CPU related. But top is easy run – worth watching for a little while.
I like giving good old ‘vmstat 5’ a go next. This prints out a line every 5 seconds. Ignore the first line. After that I start reading from the last 3 or 4 CPU columns. These are percentages: ‘us’ is user time, ‘sy’ is system time, ‘id’ is idle time, and ‘wa’ is I/O wait time. If user time is high, it might be a CPU problem. If system time is high, it may be constrained by the Operating System performance for some reason (faulty hardware, or badly configured operating system parameters). If idle time is high, then I start looking at network delays or similar (the program is not using resources – it must be blocked waiting for something). If wait time is high, I move on to investigate disk I/O (5% I consider worthy of further investigation).
kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 1 3 113726 124 0 14 6 151 600 0 521 5533 816 73 13 7 7 0 3 113643 346 0 2 14 208 690 0 585 2201 866 16 9 2 73 0 3 113659 135 0 2 2 108 323 0 516 1563 797 25 7 62 6 0 2 113661 122 0 3 2 120 375 0 527 1622 871 13 7 72 9
If system time is high, the ‘in’ (interrupts) and ‘cs’ (context switches) columns are worth a look. You need to work out what normal levels on your machine are, then run the program. If context switches jump too high (sorry, I don’t have an absolute number! It depends on your machine – e.g. number of cores) then the machine may have too many processes running. This might be a sign of a process forking out of control.
The two swap columns ‘pi’/‘si’ (page or swap in) and ‘po’/‘so’ (page or swap out) are my next port of call. If these figures spike, it is a sign the program may be running out of memory (particularly the page out rates). I have a look at the wait time as well, as high page rates with high wait time is an indication of thrashing.
The memory and io columns I have not used that much. I have found the other columns to deliver me the most value.
If disk performance is a problem, consider giving ‘iostat 5’ a run.
strace / truss
strace (Linux) or truss (on some other versions of Unix like Solaris) allow you to connect to a running process and print out system calls the process is making. This can help with identifying high system call situations (what calls are being made?). You can put ‘strace’ in front of the command you want to run to display the trace output for the whole run. If it’s a long running process, work out the process id using a ‘ps’ or ‘top’ command, then run ‘strace –p1234’ (where 1234 is the process id). Then you can kill the strace command when you are finished. Strace can be slow.
The output of the strace command will show you the system calls the process is making. If you notice a delay after a particular call, that is a good one to go investigate. This can be a good way to spot network problems for example. Is it slowing down trying to open a socket connection – could be a firewall issue. What about a DNS lookup? (It has been surprising to me in new hardware installations how often the problem has been DNS timeouts. Its not CPU bound, its not thrashing, its not IO bound – bad DNS or firewalls can insert unexpected delays that don’t stop the program as the lookup times out and tries a secondary server and works. It just runs slow.)
The above steps are not perfect or guaranteed to find the fault. But I have found them a good starting point. Overall the strategy is to try and quickly eliminate options to reduce the total search space. If you can prove it’s not CPU bound you immediately know to look at other aspects.
There are naturally lots of other techniques. What are some of your favorites? Feel free to leave a comment!