Performance Regressions

With Iai-Callgrind you can define limits for each event kinds over which a performance regression can be assumed. Per default, Iai-Callgrind does not perform default regression checks, and you have to opt-in with Callgrind::limits at benchmark level in a LibraryBenchmarkConfig or BinaryBenchmarkConfig or at a global level with Command-line arguments or Environment variables.

Note that comparing baselines also detects performance regressions. This can be useful, for example, when setting up Iai-Callgrind in the CI to cause a PR to fail when comparing to the main branch.

Define a performance regression

A performance regression check consists of an EventKind and a percentage. If the percentage is negative, then a regression is assumed to be below this limit.

The default EventKind is EventKind::Ir with a value of +10%.

For example, in a Library Benchmark, define a limit of +5% for the total instructions executed (the Ir event kind) in all benchmarks of this file :

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    Callgrind, EventKind
};
use std::hint::black_box;

#[library_benchmark]
#[bench::worst_case(vec![3, 2, 1])]
fn bench_library(data: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(data))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::default()
            .limits([(EventKind::Ir, 5.0)])
        );
    library_benchmark_groups = my_group
);
}

Now, if the comparison of the Ir events of the current bench_library benchmark run with the previous run results in an increase of over 5%, the benchmark fails. Please, also have a look at the api docs for further configuration options.

Running the benchmark from above the first time results in the following output:

lib_bench_regression::my_group::bench_library worst_case:vec! [3, 2, 1]
  Instructions:                         152|N/A                  (*********)
  L1 Hits:                              201|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               5|N/A                  (*********)
  Total read+write:                     206|N/A                  (*********)
  Estimated Cycles:                     376|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.14477s

Let's assume there's a change in my_lib::bubble_sort with a negative impact on the performance, then running the benchmark again results in an output something similar to this:

lib_bench_regression::my_group::bench_library worst_case:vec! [3, 2, 1]
  Instructions:                         264|152                  (+73.6842%) [+1.73684x]
  L1 Hits:                              341|201                  (+69.6517%) [+1.69652x]
  L2 Hits:                                0|0                    (No change)
  RAM Hits:                               6|5                    (+20.0000%) [+1.20000x]
  Total read+write:                     347|206                  (+68.4466%) [+1.68447x]
  Estimated Cycles:                     551|376                  (+46.5426%) [+1.46543x]
Performance has regressed: Instructions (152 -> 264) regressed by +73.6842% (>+5.00000%)

Regressions:

  lib_bench_regression::my_group::bench_library:
    Instructions (152 -> 264): +73.6842% exceeds limit of +5.00000%

Iai-Callgrind result: Regressed. 0 without regressions; 1 regressed; 1 benchmarks finished in 0.14849s
error: bench failed, to rerun pass `-p benchmark-tests --bench lib_bench_regression`

Caused by:
  process didn't exit successfully: `/home/lenny/workspace/programming/iai-callgrind/target/release/deps/lib_bench_regression-98382b533bca8f56 --bench` (exit status: 3)

Which event to choose to measure performance regressions?

If in doubt, the definite answer is Ir (instructions executed). If Ir event counts decrease noticeable the function (binary) runs faster. The inverse statement is also true: If the Ir counts increase noticeable, there's a slowdown of the function (binary).

These statements are not so easy to transfer to Estimated Cycles and the other event counts. But, depending on the scenario and the function (binary) under test, it can be reasonable to define more regression checks.

Who actually uses instructions to measure performance?

The ones known to the author of this humble guide are

  • SQLite: They use mainly cpu instructions to measure performance improvements (and regressions).
  • Also in benchmarks of the rustc compiler, instruction counts play a great role. But, they also use cache metrics and cycles.

If you know of others, please feel free to add them to this list.