Diagnosing failures in your results¶
This document will describe how to go about triaging your Autotest results and finding out what went wrong.
Basics¶
A lot of times when tests fail there are a number of things that could have come into play. Below are a few things that should be considered.
- Baseline
- What changed between tests
- Look at the raw results
Having a baseline is an absolute must:
- Have you run these tests on this particular system before?
- Did it pass without any issues?
These are questions you should be asking yourself. If you do not have a baseline that is the first thing to establish. It really is as simple as running a job and making note of the results.
A lot of the time that people have tests fail they do not consider what changed in between tests. Any change what so ever is important to make note of. From something big like, did I change the kernel? To something small like did I move my system to a different area which may have impacted the cooling of the system?
Lastly if nothing has changed and you have established a baseline for your machines it is time to delve into the results.
Looking at raw results¶
There are a few key areas worth looking at when evaluating what could have went wrong with your job. From the View Job tab click on raw results log. Here you will be presented with a directory structure that represents your job flat files. If you created a job with multiple machines there will be individual directories for each machine. Navigate to the machine you want to investigate.
The debug
directory¶
All tests run including the main Autotest job will have a debug
directory. Here you will find the majority of the
information you need to diagnose issues with tests.
The following files in debug
directory will give you insight into what Autotest was doing at the time:
debug/
├── build_log.gz
├── client.DEBUG
├── client.ERROR
├── client.INFO
└── client.WARNING
If you have console support (via conmux
) you should also take a look at conmux.log
.
If at any point Autotest produced a stacktrace, *.ERROR
will most likely contain this information. That is a
good place to start if the test run failed and you want to see if Autotest itself as at fault for the problem.
If both of these files are clean next we go to the <hostname>/test/
directory.
Example investigation¶
This example was created on host without time
utility, I tried to launch kernbench (output reduced):
# client/autotest-local --verbose run kernbench
10:01:59 INFO | Writing results to /usr/local/autotest/client/results/default
...
10:03:19 DEBUG| Running 'gzip -9 '/usr/local/autotest/client/results/default/kernbench/debug/build_log''
10:03:19 ERROR| Exception escaping from test:
Traceback (most recent call last):
File "/usr/local/autotest/client/shared/test.py", line 398, in _exec
*args, **dargs)
File "/usr/local/autotest/client/shared/test.py", line 823, in _call_test_function
return func(*args, **dargs)
File "/usr/local/autotest/client/shared/test.py", line 738, in _cherry_pick_call
return func(*p_args, **p_dargs)
File "/usr/local/autotest/client/tests/kernbench/kernbench.py", line 53, in warmup
self.kernel.build_timed(self.threads, output=logfile) # warmup run
File "/usr/local/autotest/client/kernel.py", line 377, in build_timed
utils.system(build_string)
File "/usr/local/autotest/client/shared/utils.py", line 1232, in system
verbose=verbose).exit_status
File "/usr/local/autotest/client/shared/utils.py", line 918, in run
"Command returned non-zero exit status")
CmdError: Command </usr/bin/time -o /dev/null make -j 4 vmlinux > /usr/local/autotest/client/results/default/kernbench/debug/build_log 2>&1> failed, rc=127, Command returned non-zero exit status
* Command:
/usr/bin/time -o /dev/null make -j 4 vmlinux >
/usr/local/autotest/client/results/default/kernbench/debug/build_log 2>&1
Exit status: 127
Duration: 0.00197100639343
Here we are investigating why kernbench failed. The first place we want to look at is the debug
directory.
There we see the following files:
# tree -s debug/
debug/
├── [ 79] build_log.gz
├── [ 1345] client.DEBUG
├── [ 0] client.ERROR
├── [ 511] client.INFO
└── [ 0] client.WARNING
As it failed during build
phase I am going to look at build_log
:
$ cat build_log
/bin/bash: /usr/bin/time: No such file or directory
Well, that is true as:
[user@a5 debug]# which time
/usr/bin/which: no time in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin)
[user@a5 debug]# ls /usr/bin/time
ls: cannot access /usr/bin/time: No such file or directory
In general test diagnoses should be that straight forward. Obvious this can not cover all cases.
The sysinfo
directory¶
The sysinfo
directory is exactly what it sounds like. A directory that contains as much information as possible that
can be gathered from the machine:
# tree sysinfo/
sysinfo/
├── df
├── dmesg.gz
├── messages.gz
└── reboot_current -> ../../sysinfo
In general this directory is your second bet for finding issues. Most files are self explanatory, you should always examine
dmesg
to make sure your boot was clean. Then depending on what test you were running that failed examine files that
will give you insight to that particular piece of hardware.
Manually running a job on a machine that is causing problems¶
A lot of times you will run into the case that all of your machines but two or three pass. While you may be able to figure out why most of them failed by looking at files it is sometimes advantageous to run the Autotest process individually on the problem machines.
Log-in to the machine and change to /home/autotest
, there you will find the installation that the server put on this
particular system.
The last control file of the job that was run is also available to you - control.autoserv
.
To start the job over again run the following:
[root@udc autotest]# bin/autotest control.autoserv
This is exactly how the autotest server starts jobs on client machines.
If you have a large control file that runs multiple tests and you are only interested in one or two of them you can safely edit this file and remove any tests that you know work for sure. A lot of the time failures can be diagnosed by babysitting a machine and seeing what else is going on with general diagnostic on a machine.