Detecting Performance Regressions
With Iai-Callgrind you can define limits for each callgrind/cachegrind event
kind or dhat metric over which a performance regression can be assumed. Per
default, Iai-Callgrind does not perform regression checks, and you have to
opt-in with Callgrind::soft_limits
, Callgrind::hard_limits
,
Cachegrind::soft_limits
, ... at benchmark level in
LibraryBenchmarkConfig::tool
or BinaryBenchmarkConfig::tool
or at a more
global level with Command-line arguments or Environment
variables, see
below.
For a soft limit, a performance regression check consists of an EventKind
,
CachegrindMetric
or DhatMetric
and a percentage. If the percentage is
negative, then a regression is assumed to be below this limit. Hard limits
restrict the EventKind
, ... by an absolute number.
Note that comparing baselines also detects performance regressions. This can be useful, for example, when setting up Iai-Callgrind in the CI to cause a PR to fail when comparing to the main branch.
Regressions are considered errors and will cause the benchmark to fail if they
occur, and Iai-Callgrind will exit with error code 3
.
Defining limits on the command-line
Limits can be defined on the command-line for the following tools with
--callgrind-limits
(IAI_CALLGRIND_CALLGRIND_LIMITS
), --cachegrind-limits
(IAI_CALLGRIND_CACHEGRIND_LIMITS
) and --dhat-limits
(IAI_CALLGRIND_DHAT_LIMITS
). Command-line limits overwrite the limits
specified in the benchmark file (see below).
In order to disambiguate between soft and hard limits, soft limits have to be
suffixed with a %
. Hard limits are bare numbers. For example to limit the
total instructions executed ir
(printed as Instructions
in the callgrind
terminal output) to 5%
:
cargo bench --bench iai_callgrind_benchmark -- --callgrind-limits='ir=5%'
These command-line arguments and environment variables can be used to define
soft limits and hard limits in one go with the |
-operator (e.g.
--callgrind-limits='ir=5%|10000'
) or multiple limits at once separated by a
,
(e.g. --callgrind-limits='ir=5%|10000,totalrw=2%'
).
For a list of all allowed callgrind metrics (like ir
) see the docs of
EventKind
, for cachegrind metrics CachegrindMetric
and for dhat metrics
DhatMetric
. It is sometimes more convenient to define limits for whole
groups with the @
-operator: --callgrind-metrics='@all=5%'
. All allowed
groups and their members for callgrind metrics can be found in
CallgrindMetrics
, for cachegrind metrics in CachegrindMetrics
and dhat
metrics in DhatMetrics
.
Multiple specifications of the same EventKind
, ... overwrite the previous one
until the last one wins. This is useful for example to specify a limit for all
event kinds and then overwrite the limit for a specific event kind:
--callgrind-limits='@all=10%,ir=5%'
The format, short names and groups in full detail
For --callgrind-limits
:
arg ::= pair ("," pair)*
pair ::= key "=" value ("|" value)*
key ::= group | event ; matched case-insensitive
group ::= "@" ( "default"
| "all"
| ("cachemisses" | "misses" | "ms")
| ("cachemissrates" | "missrates" | "mr")
| ("cachehits" | "hits" | "hs")
| ("cachehitrates" | "hitrates" | "hr")
| ("cachesim" | "cs")
| ("cacheuse" | "cu")
| ("systemcalls" | "syscalls" | "sc")
| ("branchsim" | "bs")
| ("writebackbehaviour" | "writeback" | "wb")
)
event ::= EventKind
value ::= soft_limit | hard_limit
soft_limit ::= (integer | float) "%" ; can be negative
hard_limit ::= (integer | float) ; float is only allowed for EventKinds which are
; float like `L1HitRate` but not `L1Hits`
with:
- Groups with a long name have their allowed abbreviations placed in the same parentheses.
EventKind
is the exact name of the enum variant (case insensitive)integer
is au64
andfloat
is af64
For --cachegrind-limits
replace the group
and event
from above with:
group ::= "@" ( "default"
| "all"
| ("cachemisses" | "misses" | "ms")
| ("cachemissrates" | "missrates" | "mr")
| ("cachehits" | "hits" | "hs")
| ("cachehitrates" | "hitrates" | "hr")
| ("cachesim" | "cs")
| ("branchsim" | "bs")
)
event ::= CachegrindMetric
For --dhat-limits
replace the group
and event
from above with:
group ::= "@" ( "default" | "all" )
event ::= ( "totalunits" | "tun" )
| ( "totalevents" | "tev" )
| ( "totalbytes" | "tb" )
| ( "totalblocks" | "tbk" )
| ( "attgmaxbytes" | "gb" )
| ( "attgmaxblocks" | "gbk" )
| ( "attendbytes" | "eb" )
| ( "attendblocks" | "ebk" )
| ( "readsbytes" | "rb" )
| ( "writesbytes" | "wb" )
| ( "totallifetimes" | "tl" )
| ( "maximumbytes" | "mb" )
| ( "maximumblocks" | "mbk" )
Define a performance regression check in a benchmark
For example, in a Library
Benchmark, define a soft
limit of +5%
for the Ir
event kind for all benchmarks of this file:
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, Callgrind, EventKind }; use std::hint::black_box; #[library_benchmark] #[bench::worst_case(vec![3, 2, 1])] fn bench_library(data: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(data)) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!( config = LibraryBenchmarkConfig::default() .tool(Callgrind::default() .soft_limits([(EventKind::Ir, 5.0)]) ); library_benchmark_groups = my_group ); }
Now, if the comparison of the Ir
events of the current bench_library
benchmark run with the previous run results in an increase of over 5%, the
benchmark fails. Running the benchmark from above the first time results in the
following output:
lib_bench_regression::my_group::bench_library worst_case:vec! [3, 2, 1]
Instructions: 152|N/A (*********)
L1 Hits: 201|N/A (*********)
LL Hits: 0|N/A (*********)
RAM Hits: 5|N/A (*********)
Total read+write: 206|N/A (*********)
Estimated Cycles: 376|N/A (*********)
Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.14477s
Let's assume there's a change in my_lib::bubble_sort
with a negative impact on
the performance, then running the benchmark again results in an output something
similar to this:
lib_bench_regression::my_group::bench_library worst_case:vec! [3, 2, 1]
Instructions: 264|152 (+73.6842%) [+1.73684x]
L1 Hits: 341|201 (+69.6517%) [+1.69652x]
LL Hits: 0|0 (No change)
RAM Hits: 6|5 (+20.0000%) [+1.20000x]
Total read+write: 347|206 (+68.4466%) [+1.68447x]
Estimated Cycles: 551|376 (+46.5426%) [+1.46543x]
Performance has regressed: Instructions (152 -> 264) regressed by +73.6842% (>+5.00000%)
Regressions:
lib_bench_regression::my_group::bench_library:
Instructions (152 -> 264): +73.6842% exceeds limit of +5.00000%
Iai-Callgrind result: Regressed. 0 without regressions; 1 regressed; 1 benchmarks finished in 0.14849s
error: bench failed, to rerun pass `-p benchmark-tests --bench lib_bench_regression`
Caused by:
process didn't exit successfully: `/home/lenny/workspace/programming/iai-callgrind/target/release/deps/lib_bench_regression-98382b533bca8f56 --bench` (exit status: 3)
Which event to choose to measure performance regressions?
For callgrind/cachegrind and if in doubt, the answer is Ir
(instructions
executed). If Ir
event counts decrease noticeable the function (binary) runs
faster. The inverse statement is also true: If the Ir
counts increase
noticeable, there's a slowdown of the function (binary).
These statements are not so easy to transfer to Estimated Cycles
, cache
metrics and most of the other event counts. But, depending on the scenario and
the function (binary) under test, it can be reasonable to define more regression
checks.
Who actually uses instructions to measure performance?
The ones known to the author of this humble guide are
- SQLite: They use mainly cpu instructions to measure performance improvements (and regressions).
- Also in benchmarks of the rustc compiler and compiler-builtins, instruction counts play a great role. But, they also use cache metrics and cycles.
- SpacetimeDB
If you know of others, please feel free to add them to this list.