Introduction

This is the guide for Iai-Callgrind, a benchmarking framework/harness which uses Valgrind's Callgrind to provide extremely accurate and consistent measurements of Rust code, making it perfectly suited to run in environments like a CI. Iai-Callgrind is flexible and despite its name it's possible to run Cachegrind or any other Valgrind tool like DHAT in addition to or instead of Callgrind.

Iai-Callgrind is fully documented in this guide and in the api documentation at docs.rs.

Iai-Callgrind is also:

Precise: High-precision measurements of Instruction counts and many other metrics allow you to reliably detect very small optimizations and regressions of your code.
Consistent: Iai-Callgrind can take accurate measurements even in virtualized CI environments and make them comparable between different systems completely negating the noise of the environment.
Fast: Each benchmark is only run once, which is usually much faster than benchmarks which measure execution and wall-clock time. Benchmarks measuring the wall-clock time have to be run many times to increase their accuracy, detect outliers, filter out noise, etc.
Visualizable: Iai-Callgrind generates a Callgrind (DHAT, ...) profile of the benchmarked code and can be configured to create flamegraph-like charts from Callgrind metrics. In general, all Valgrind-compatible tools like callgrind_annotate, kcachegrind or dh_view.html and others to analyze the results in detail are fully supported.
Easy: The API for setting up benchmarks is easy to use and allows you to quickly create concise and clear benchmarks. Focus more on profiling and your code than on the framework.

Design philosophy and goals

Iai-Callgrind benchmarks are designed to be runnable with cargo bench. The benchmark files are expanded to a benchmarking harness which replaces the native benchmark harness of rust. Iai-Callgrind is a profiling framework that can quickly and reliably detect performance regressions and optimizations even in noisy environments with a precision that is impossible to achieve with wall-clock time based benchmarks. At the same time, we want to abstract the complicated parts and repetitive tasks away and provide an easy to use and intuitive api. Iai-Callgrind tries to stay out of your way so you can focus more on profiling and your code!

When not to use Iai-Callgrind

Although Iai-Callgrind is useful in many projects, there are cases where Iai-Callgrind is not a good fit.

If you need wall-clock times, Iai-Callgrind cannot help you much. The estimation of cpu cycles merely correlates to wall-clock times but is not a replacement for wall-clock times. The cycles estimation is primarily designed to be a relative metric to be used for comparison.
Iai-Callgrind cannot be run on Windows and platforms not supported by Valgrind.

Improving Iai-Callgrind

You want to improve the guide? You have an idea for a new feature, are missing a functionality or have found a bug? We would love to here about it. You want to contribute and hack on Iai-Callgrind?

Please don't hesitate to open an issue.

You want to hack on this guide? The source code of this book lives in the docs subdirectory.

Getting Help

Reach out to us on Github Discussions or open an Issue in the Iai-Callgrind Repository. Check the open and closed issues in the issue board, maybe you can already find a solution to your problem there.

The api documentation can be found on docs.rs but you might also want to check out the Troubleshooting section in the sidebar of this guide.

Prerequisites

In order to use Iai-Callgrind, you must have Valgrind installed. This means that Iai-Callgrind cannot be used on platforms that are not supported by Valgrind.

The default benchmarking tool is Callgrind and is in most cases perfectly suited to do the job but if you want or need to use Cachegrind instead of Callgrind you require valgrind version >= 3.22 and client requests (see below).

Debug Symbols

It's required to run the Iai-Callgrind benchmarks with debugging symbols switched on. For example in your ~/.cargo/config or your project's Cargo.toml:

[profile.bench]
debug = true

Now, all benchmarks which are run with cargo bench include the debug symbols. (See also Cargo Profiles and Cargo Config).

It's required that settings like strip = true or other configuration options stripping the debug symbols need to be disabled explicitly for the bench profile if you have changed this option for the release profile. For example:

[profile.release]
strip = true

[profile.bench]
debug = true
strip = false

Valgrind Client Requests

If you want to make use of the mighty Valgrind Client Request Mechanism shipped with Iai-Callgrind, you also need libclang (clang >= 5.0) installed. See also the requirements of bindgen and of cc.

More details on the usage and requirements of Valgrind Client Requests in this chapter of the guide.

Installation of Valgrind

Iai-Callgrind is intentionally independent of a specific version of valgrind. However, Iai-Callgrind was only tested with versions of valgrind >= 3.20.0. It is therefore highly recommended to use a recent version of valgrind. Also, if you want or need to, building valgrind from source is usually a straight-forward process. Just make sure the valgrind binary is in your $PATH so that Iai-Callgrind can find it.

Installation of valgrind with your package manager

Alpine Linux

apk add valgrind

Arch Linux

pacman -Sy valgrind

Debian/Ubuntu

apt-get install valgrind

Fedora Linux

dnf install valgrind

FreeBSD

pkg install valgrind

Valgrind is available for the following distributions

Iai-Callgrind

Iai-Callgrind is divided into the library iai-callgrind and the benchmark runner iai-callgrind-runner.

Installation of the library

To start with Iai-Callgrind, add the following to your Cargo.toml file:

[dev-dependencies]
iai-callgrind = "0.16.1"

or run

cargo add --dev iai-callgrind@0.16.1

Installation of the benchmark runner

To be able to run the benchmarks you'll also need the iai-callgrind-runner binary installed somewhere in your $PATH. Otherwise, there is no need to interact with iai-callgrind-runner as it is just an implementation detail.

From Source

cargo install --version 0.16.1 iai-callgrind-runner

There's also the possibility to install the binary somewhere else and point the IAI_CALLGRIND_RUNNER environment variable to the absolute path of the iai-callgrind-runner binary like so:

cargo install --version 0.16.1 --root /tmp iai-callgrind-runner
IAI_CALLGRIND_RUNNER=/tmp/bin/iai-callgrind-runner cargo bench --bench my-bench

Binstall

The iai-callgrind-runner binary is pre-built for most platforms supported by valgrind and easily installable with binstall

cargo binstall iai-callgrind-runner@0.16.1

Updating

When updating the iai-callgrind library, you'll also need to update iai-callgrind-runner and vice-versa or else the benchmark runner will exit with an error.

In the Github CI

Since the iai-callgrind-runner version must match the iai-callgrind library version it's best to automate this step in the CI. A job step in the github actions CI could look like this

- name: Install iai-callgrind-runner
  run: |
    version=$(cargo metadata --format-version=1 |\
      jq '.packages[] | select(.name == "iai-callgrind").version' |\
      tr -d '"'
    )
    cargo install iai-callgrind-runner --version $version

Or, speed up the overall installation time with binstall using the taiki-e/install-action

- uses: taiki-e/install-action@cargo-binstall
- name: Install iai-callgrind-runner
  run: |
    version=$(cargo metadata --format-version=1 |\
      jq '.packages[] | select(.name == "iai-callgrind").version' |\
      tr -d '"'
    )
    cargo binstall --no-confirm iai-callgrind-runner --version $version

Overview

Iai-Callgrind can be used to benchmark the library and binary of your project's crates. Library and binary benchmarks are treated differently by Iai-Callgrind and cannot be intermixed in the same benchmark file. This is indeed a feature and helps keeping things organized. Having different and multiple benchmark files for library and binary benchmarks is no problem for Iai-Callgrind and is usually a good idea anyway. Having benchmarks for different binaries in the same benchmark file however is fully supported.

Head over to the Quickstart section of library benchmarks if you want to start benchmarking your library functions or to the Quickstart section of binary benchmarks if you want to start benchmarking your crate's binary (binaries).

Binary Benchmarks vs Library Benchmarks

Almost all binary benchmarks can be written as library benchmarks. For example, if you have a main.rs file of your binary, which basically looks like this

mod my_lib { pub fn run() {} }
use my_lib::run;

fn main() {
    run();
}

you could also choose to benchmark the library function my_lib::run in a library benchmark instead of the binary in a binary benchmark. There's no real downside to either of the benchmark schemes and which scheme you want to use heavily depends on the structure of your binary. As a maybe obvious rule of thumb, micro-benchmarks of specific functions should go into library benchmarks and macro-benchmarks into binary benchmarks. Generally, choose the closest access point to the program point you actually want to benchmark.

You should always choose binary benchmarks over library benchmarks if you want to benchmark the behaviour of the executable if the input comes from a pipe since this feature is exclusive to binary benchmarks. See The Command's stdin and simulating piped input for more.

Library Benchmarks

You want to dive into benchmarking your library? Best start with the Quickstart section and then go through the examples in the other sections of this guide. If you need more examples see here

Important default behaviour

The environment variables are cleared before running a library benchmark. Have a look into the Configuration section if you need to change that behavior. Iai-Callgrind sometimes deviates from the valgrind defaults which are:

Iai-Callgrind	Valgrind (v3.23)
`--trace-children=yes`	`--trace-children=no`
`--fair-sched=try`	`--fair-sched=no`
`--separate-threads=yes`	`--separate-threads=no`
`--cache-sim=yes`	`--cache-sim=no`

The thread and subprocess specific valgrind options enable tracing threads and subprocesses basically but there's usually some additional configuration necessary to trace the metrics of threads and subprocesses.

As show in the table above, the benchmarks run with cache simulation switched on. This adds run time. If you don't need the cache metrics and estimation of cycles, you can easily switch cache simulation off for example with:

#![allow(unused)]
fn main() {
extern crate iai_callgrind;
use iai_callgrind::{LibraryBenchmarkConfig, Callgrind};

LibraryBenchmarkConfig::default().tool(Callgrind::with_args(["--cache-sim=no"]));
}

To switch off cache simulation for all benchmarks in the same file:

extern crate iai_callgrind;
mod my_lib { pub fn fibonacci(a: u64) -> u64 { a } }
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    Callgrind
};
use std::hint::black_box;

#[library_benchmark]
fn bench_fibonacci() -> u64 {
    black_box(my_lib::fibonacci(10))
}

library_benchmark_group!(name = fibonacci_group; benchmarks = bench_fibonacci);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::with_args(["--cache-sim=no"]));
    library_benchmark_groups = fibonacci_group
);
}

Iai-Callgrind reports the cache hits and an estimation of cpu cycles:

test_lib_bench_readme_example_fibonacci::bench_fibonacci_group::bench_fibonacci short:10
  Instructions:                        1734|1734                 (No change)
  L1 Hits:                             2359|2359                 (No change)
  LL Hits:                                0|0                    (No change)
  RAM Hits:                               3|3                    (No change)
  Total read+write:                    2362|2362                 (No change)
  Estimated Cycles:                    2464|2464                 (No change)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

If you prefer cache misses over cache hits or just want both metrics displayed you can fully customize the callgrind output format.

Quickstart

Create a file $WORKSPACE_ROOT/benches/library_benchmark.rs and add

[[bench]]
name = "library_benchmark"
harness = false

to your Cargo.toml. harness = false, tells cargo to not use the default rust benchmarking harness which is important because Iai-Callgrind has an own benchmarking harness.

Then copy the following content into this file:

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

#[library_benchmark]
#[bench::short(10)]
#[bench::long(30)]
fn bench_fibonacci(value: u64) -> u64 {
    black_box(fibonacci(value))
}

library_benchmark_group!(
    name = bench_fibonacci_group;
    benchmarks = bench_fibonacci
);

fn main() {
main!(library_benchmark_groups = bench_fibonacci_group);
}

Now, that your first library benchmark is set up, you can run it with

cargo bench

and should see something like the below

library_benchmark::bench_fibonacci_group::bench_fibonacci short:10
  Instructions:                1734|N/A             (*********)
  L1 Hits:                     2359|N/A             (*********)
  LL Hits:                        0|N/A             (*********)
  RAM Hits:                       3|N/A             (*********)
  Total read+write:            2362|N/A             (*********)
  Estimated Cycles:            2464|N/A             (*********)
library_benchmark::bench_fibonacci_group::bench_fibonacci long:30
  Instructions:            26214734|N/A             (*********)
  L1 Hits:                 35638616|N/A             (*********)
  LL Hits:                        2|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:        35638622|N/A             (*********)
  Estimated Cycles:        35638766|N/A             (*********)

Iai-Callgrind result: Ok. 2 without regressions; 0 regressed; 2 benchmarks finished in 0.49333s

In addition, you'll find the callgrind output and the output of other valgrind tools in target/iai, if you want to investigate further with a tool like callgrind_annotate etc.

When running the same benchmark again, the output will report the differences between the current and the previous run. Say you've made change to the fibonacci function, then you may see something like this:

library_benchmark::bench_fibonacci_group::bench_fibonacci short:10
  Instructions:                2805|1734            (+61.7647%) [+1.61765x]
  L1 Hits:                     3815|2359            (+61.7211%) [+1.61721x]
  LL Hits:                        0|0               (No change)
  RAM Hits:                       3|3               (No change)
  Total read+write:            3818|2362            (+61.6427%) [+1.61643x]
  Estimated Cycles:            3920|2464            (+59.0909%) [+1.59091x]
library_benchmark::bench_fibonacci_group::bench_fibonacci long:30
  Instructions:            16201597|26214734        (-38.1966%) [-1.61803x]
  L1 Hits:                 22025876|35638616        (-38.1966%) [-1.61803x]
  LL Hits:                        2|2               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:        22025882|35638622        (-38.1966%) [-1.61803x]
  Estimated Cycles:        22026026|35638766        (-38.1964%) [-1.61803x]

Iai-Callgrind result: Ok. 2 without regressions; 0 regressed; 2 benchmarks finished in 0.49333s

Structure of a library benchmark

We're reusing our example from the Quickstart section.

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

#[library_benchmark]
#[bench::short(10)]
#[bench::long(30)]
fn bench_fibonacci(value: u64) -> u64 {
    black_box(fibonacci(value))
}

library_benchmark_group!(
    name = bench_fibonacci_group;
    benchmarks = bench_fibonacci
);

fn main() {
main!(library_benchmark_groups = bench_fibonacci_group);
}

First of all, you need a public function in your library which you want to benchmark. In this example this is the fibonacci function which, for the sake of simplicity, lives in the benchmark file itself but doesn't have to. If it had been located in my_lib::fibonacci, you simply import that function with use my_lib::fibonacci and go on as shown above. Next, you need a library_benchmark_group! in which you specify the names of the benchmark functions. Finally, the benchmark harness is created by the main! macro.

The benchmark function

The benchmark function has to be annotated with the #[library_benchmark] attribute. The #[bench] attribute is an inner attribute of the #[library_benchmark] attribute. It consists of a mandatory id (the ID part in #[bench::ID(/* ... */)]) and in its most basic form, an optional list of arguments which are passed to the benchmark function as parameters. Naturally, the parameters of the benchmark function must match the argument list of the #[bench] attribute. It is always a good idea to return something from the benchmark function, here it is the computed u64 value from the fibonacci function wrapped in a black_box. See the docs of std::hint::black_box for more information about its usage. Simply put, all values and variables in the benchmarking function (but not in your library function) need to be wrapped in a black_box except for the input parameters (here value) because Iai-Callgrind already does that. But, it is no error to black_box the value again.

The bench attribute takes any expression which includes function calls. The following would have worked too and is one way to avoid the costs of the setup code being attributed to the benchmarked function.

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

fn some_setup_func(value: u64) -> u64 {
    value + 10
}

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

#[library_benchmark]
#[bench::short(10)]
// Note the usage of the `some_setup_func` in the argument list of this #[bench]
#[bench::long(some_setup_func(20))]
fn bench_fibonacci(value: u64) -> u64 {
    black_box(fibonacci(value))
}

library_benchmark_group!(
   name = bench_fibonacci_group;
   benchmarks = bench_fibonacci
);

fn main() {
main!(library_benchmark_groups = bench_fibonacci_group);
}

The perhaps most crucial part in setting up library benchmarks is to keep the body of benchmark functions clean from any setup or teardown code. There are other ways to avoid setup and teardown code in the benchmark function, which are discussed in full detail in the setup and teardown section.

The group

The name of the benchmark functions, here the only benchmark function bench_fibonacci, which should be benchmarked need to be specified in a library_benchmark_group! in the benchmarks parameter. You can create as many groups as you like, and you can use it to organize related benchmarks. Each group needs a unique name.

The main macro

Each group you want to be benchmarked needs to be specified in the library_benchmark_groups parameter of the main! macro and you're all set.

The macros in more detail

This section is a brief reference to all the macros available in library benchmarks. Feel free to come back here from other sections if you need a reference. For the complete documentation of each macro see the api Documentation.

For the following examples it is assumed that there is a file lib.rs in a crate named my_lib with the following content:

#![allow(unused)]
fn main() {
pub fn bubble_sort(mut array: Vec<i32>) -> Vec<i32> {
    for i in 0..array.len() {
        for j in 0..array.len() - i - 1 {
            if array[j + 1] < array[j] {
                array.swap(j, j + 1);
            }
        }
    }
    array
}
}

The `#[library_benchmark]` attribute

This attribute needs to be present on all benchmark functions specified in the library_benchmark_group. The benchmark function can then be further annotated with the inner #[bench] or #[benches] attributes.

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[bench::one(vec![1])]
#[benches::multiple(vec![1, 2], vec![1, 2, 3], vec![1, 2, 3, 4])]
fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(values))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The following parameters are accepted:

config: Takes a LibraryBenchmarkConfig
setup: A global setup function which is applied to all following #[bench] and #[benches] attributes if not overwritten by a setup parameter of these attributes.
teardown: Similar to setup but takes a global teardown function.

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
           .truncate_description(None)
        )
)]
#[bench::one(vec![1])]
fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(values))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The `#[bench]` attribute

The basic structure is #[bench::some_id(/* parameters */)]. The part after the :: must be an id unique within the same #[library_benchmark]. This attribute accepts the following parameters:

args: A tuple with a list of arguments which are passed to the benchmark function. The parentheses also need to be present if there is only a single argument (#[bench::my_id(args = (10))]).
config: Accepts a LibraryBenchmarkConfig
setup: A function which takes the arguments specified in the args parameter and passes its return value to the benchmark function.
teardown: A function which takes the return value of the benchmark function.

If no other parameters besides args are present you can simply pass the arguments as a list of values. So, instead of #[bench::my_id(args = (10, 20))], you could also use the shorter #[bench::my_id(10, 20)].

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig};
use std::hint::black_box;

// This function is used to create a worst case array we want to sort with our implementation of
// bubble sort
pub fn worst_case(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[bench::one(vec![1])]
#[bench::worst_two(args = (vec![2, 1]))]
#[bench::worst_four(args = (4), setup = worst_case)]
fn bench_bubble_sort(value: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(value))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The `#[benches]` attribute

This attribute is used to specify multiple benchmarks at once. It accepts the same parameters as the #[bench] attribute: args, config, setup and teardown and additionally the file parameter which is explained in detail here. In contrast to the args parameter in #[bench], args takes an array of arguments.

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig};
use std::hint::black_box;

pub fn worst_case(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[benches::worst_two_and_three(args = [vec![2, 1], vec![3, 2, 1]])]
#[benches::worst_four_to_nine(args = [4, 5, 6, 7, 8, 9], setup = worst_case)]
fn bench_bubble_sort(value: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(value))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The library_benchmark_group! macro

The library_benchmark_group macro accepts the following parameters (in this order and separated by a semicolon):

name (mandatory): A unique name used to identify the group for the main! macro
config (optional): A LibraryBenchmarkConfig which is applied to all benchmarks within the same group.
compare_by_id (optional): The default is false. If true, all benches in the benchmark functions specified in the benchmarks parameter are compared with each other as long as the ids (the part after the :: in #[bench::id(...)]) match. See also Comparing benchmark functions
setup (optional): A setup function or any valid expression which is run before all benchmarks of this group
teardown (optional): A teardown function or any valid expression which is run after all benchmarks of this group
benchmarks (mandatory): A list of comma separated paths of benchmark functions which are annotated with #[library_benchmark]

Note the setup and teardown parameters are different to the ones of #[library_benchmark], #[bench] and #[benches]. They accept an expression or function call as in setup = group_setup_function(). Also, these setup and teardown functions are not overridden by the ones from any of the before mentioned attributes.

The main! macro

This macro is the entry point for Iai-Callgrind and creates the benchmark harness. It accepts the following top-level arguments in this order (separated by a semicolon):

config (optional): Optionally specify a LibraryBenchmarkConfig
setup (optional): A setup function or any valid expression which is run before all benchmarks
teardown (optional): A setup function or any valid expression which is run after all benchmarks
library_benchmark_groups (mandatory): The name of one or more library benchmark groups. Multiple names are separated by a comma.

Like the setup and teardown of the library_benchmark_group, these parameters accept an expression and are not overridden by the setup and teardown of the library_benchmark_group, #[library_benchmark], #[bench] or #[benches] attribute.

setup and teardown

setup and teardown are your bread and butter in library benchmarks. The benchmark functions need to be as clean as possible and almost always only contain the function call to the function of your library which you want to benchmark.

Setup

In an ideal world you don't need any setup code, and you can pass arguments to the function as they are.

But, for example if a function expects a File and not a &str with the path to the file you need setup code. Iai-Callgrind has an easy-to-use system in place to allow you to run any setup code before the function is executed and this setup code is not attributed to the metrics of the benchmark.

If the setup parameter is specified, the setup function takes the arguments from the #[bench] (or #[benches]) attributes and the benchmark function receives the return value of the setup function as parameter. This is a small indirection with great effect. The effect is best shown with an example:

extern crate iai_callgrind;
mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;
use std::fs::File;

fn open_file(path: &str) -> File {
    File::open(path).unwrap()
}

#[library_benchmark]
#[bench::first(args = ("path/to/file"), setup = open_file)]
fn count_bytes_fast(file: File) -> u64 {
    black_box(my_lib::count_bytes_fast(file))
}

library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast);
fn main() {
main!(library_benchmark_groups = my_group);
}

You can actually see the effect of using a setup function in the output of the benchmark. Let's assume the above benchmark is in a file benches/my_benchmark.rs, then running

IAI_CALLGRIND_NOCAPTURE=true cargo bench

result in the benchmark output like below.

my_benchmark::my_group::count_bytes_fast first:open_file("path/to/file")
  Instructions:             1630162|N/A             (*********)
  L1 Hits:                  2507933|N/A             (*********)
  LL Hits:                        2|N/A             (*********)
  RAM Hits:                      11|N/A             (*********)
  Total read+write:         2507946|N/A             (*********)
  Estimated Cycles:         2508328|N/A             (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

The description in the headline contains open_file("path/to/file"), your setup function open_file with the value of the parameter it is called with.

If you need to specify the same setup function for all (or almost all) #[bench] and #[benches] in a #[library_benchmark] you can use the setup parameter of the #[library_benchmark]:

extern crate iai_callgrind;
mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;
use std::fs::File;
use std::io::{Seek, SeekFrom};

fn open_file(path: &str) -> File {
    File::open(path).unwrap()
}

fn open_file_with_offset(path: &str, offset: u64) -> File {
    let mut file = File::open(path).unwrap();
    file.seek(SeekFrom::Start(offset)).unwrap();
    file
}

#[library_benchmark(setup = open_file)]
#[bench::small("path/to/small")]
#[bench::big("path/to/big")]
#[bench::with_offset(args = ("path/to/big", 100), setup = open_file_with_offset)]
fn count_bytes_fast(file: File) -> u64 {
    black_box(my_lib::count_bytes_fast(file))
}

library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast);
fn main() {
main!(library_benchmark_groups = my_group);
}

The above will use the open_file function in the small and big benchmarks and the open_file_with_offset function in the with_offset benchmark.

Teardown

What about teardown and why should you use it? Usually the teardown isn't needed but for example if you intend to make the result from the benchmark visible in the benchmark output, the teardown is the perfect place to do so.

The teardown function takes the return value of the benchmark function as its argument:

extern crate iai_callgrind;
mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;
use std::fs::File;

fn open_file(path: &str) -> File {
    File::open(path).unwrap()
}

fn print_bytes_read(num_bytes: u64) {
    println!("bytes read: {num_bytes}");
}

#[library_benchmark]
#[bench::first(
    args = ("path/to/big"),
    setup = open_file,
    teardown = print_bytes_read
)]
fn count_bytes_fast(file: File) -> u64 {
    black_box(my_lib::count_bytes_fast(file))
}

library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast);
fn main() {
main!(library_benchmark_groups = my_group);
}

Note Iai-Callgrind captures all output per default. In order to actually see the output of the benchmark, setup and teardown functions, it is required to run the benchmarks with the flag --nocapture or set the environment variable IAI_CALLGRIND_NOCAPTURE=true. Let's assume the above benchmark is in a file benches/my_benchmark.rs, then running

IAI_CALLGRIND_NOCAPTURE=true cargo bench

results in output like the below

my_benchmark::my_group::count_bytes_fast first:open_file("path/to/big")
bytes read: 25078
- end of stdout/stderr
  Instructions:             1630162|N/A             (*********)
  L1 Hits:                  2507931|N/A             (*********)
  LL Hits:                        2|N/A             (*********)
  RAM Hits:                      13|N/A             (*********)
  Total read+write:         2507946|N/A             (*********)
  Estimated Cycles:         2508396|N/A             (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

The output of the teardown function is now visible in the benchmark output above the - end of stdout/stderr line.

Specifying multiple benches at once

Multiple benches can be specified at once with the #[benches] attribute.

The `#[benches]` attribute in more detail

Let's start with an example:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;
use my_lib::bubble_sort;

fn setup_worst_case_array(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[benches::multiple(vec![1], vec![5])]
#[benches::with_setup(args = [1, 5], setup = setup_worst_case_array)]
fn bench_bubble_sort_with_benches_attribute(input: Vec<i32>) -> Vec<i32> {
    black_box(bubble_sort(input))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort_with_benches_attribute);
fn main () {
main!(library_benchmark_groups = my_group);
}

Usually the arguments are passed directly to the benchmarking function as it can be seen in the #[benches::multiple(/* arguments */)] case. In #[benches::with_setup(/* ... */)], the arguments are passed to the setup function instead. The above #[library_benchmark] is pretty much the same as

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;
use my_lib::bubble_sort;

fn setup_worst_case_array(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[bench::multiple_0(vec![1])]
#[bench::multiple_1(vec![5])]
#[bench::with_setup_0(setup_worst_case_array(1))]
#[bench::with_setup_1(setup_worst_case_array(5))]
fn bench_bubble_sort_with_benches_attribute(input: Vec<i32>) -> Vec<i32> {
    black_box(bubble_sort(input))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort_with_benches_attribute);
fn main () {
main!(library_benchmark_groups = my_group);
}

but a lot more concise especially if a lot of values are passed to the same setup function.

The `file` parameter

Reading inputs from a file allows for example sharing the same inputs between different benchmarking frameworks like criterion or if you simply have a long list of inputs you might find it more convenient to read them from a file.

The file parameter, exclusive to the #[benches] attribute, does exactly that and reads the specified file line by line creating a benchmark from each line. The line is passed to the benchmark function as String or if the setup parameter is also present to the setup function. A small example assuming you have a file benches/inputs (relative paths are interpreted to the workspace root) with the following content

1
11
111

then

extern crate iai_callgrind;
mod my_lib { pub fn string_to_u64(value: String) -> Result<u64, String> { Ok(1) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[benches::from_file(file = "benches/inputs")]
fn some_bench(line: String) -> Result<u64, String> {
    black_box(my_lib::string_to_u64(line))
}

library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

The above is roughly equivalent to the following but with the args parameter

extern crate iai_callgrind;
mod my_lib { pub fn string_to_u64(value: String) -> Result<u64, String> { Ok(1) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[benches::from_args(args = [1.to_string(), 11.to_string(), 111.to_string()])]
fn some_bench(line: String) -> Result<u64, String> {
    black_box(my_lib::string_to_u64(line))
}

library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

The true power of the file parameter comes with the setup function because you can format the lines in the file as you like and convert each line in the setup function to the format as you need it in the benchmark. For example if you decided to go with a csv like format in the file benches/inputs

255;255;255
0;0;0

and your library has a function which converts from RGB to HSV color space:

extern crate iai_callgrind;
mod my_lib { pub fn rgb_to_hsv(a: u8, b: u8, c:u8) -> (u16, u8, u8) { (a.into(), b, c) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

fn decode_line(line: String) -> (u8, u8, u8) {
    if let &[a, b, c] = line.split(";")
        .map(|s| s.parse::<u8>().unwrap())
        .collect::<Vec<u8>>()
        .as_slice() 
    {
        (a, b, c)
    } else {
        panic!("Wrong input format in line '{line}'");
    }
}

#[library_benchmark]
#[benches::from_file(file = "benches/inputs", setup = decode_line)]
fn some_bench((a, b, c): (u8, u8, u8)) -> (u16, u8, u8) {
    black_box(my_lib::rgb_to_hsv(black_box(a), black_box(b), black_box(c)))
}

library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

Generic benchmark functions

Benchmark functions can be generic. And setup and teardown functions, too. There's actually not much more to say about it since generic benchmark (setup and teardown) functions behave exactly the same way as you would expect it from any other generic function.

However, there is a common pitfall. If you have a function count_lines_in_file_fast which expects as parameter a PathBuf and although it is convenient especially when you have to specify many paths, don't do this:

extern crate iai_callgrind;
mod my_lib { pub fn count_lines_in_file_fast(_path: std::path::PathBuf) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;

#[library_benchmark]
#[bench::first("path/to/file")]
fn generic_bench<T>(path: T) -> u64 where T: Into<PathBuf> {
    black_box(my_lib::count_lines_in_file_fast(black_box(path.into())))
}

library_benchmark_group!(name = my_group; benchmarks = generic_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

Since path.into() is called in the benchmark function itself, the conversion from a &str to a PathBuf is attributed to the benchmark metrics. This is almost never what you intended. You should instead convert the argument to a PathBuf in a generic setup function like that:

extern crate iai_callgrind;
mod my_lib { pub fn count_lines_in_file_fast(_path: std::path::PathBuf) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;

fn convert_to_pathbuf<T>(path: T) -> PathBuf where T: Into<PathBuf> {
    path.into()
}

#[library_benchmark]
#[bench::first(args = ("path/to/file"), setup = convert_to_pathbuf)]
fn not_generic_anymore(path: PathBuf) -> u64 {
    black_box(my_lib::count_lines_in_file_fast(path))
}

library_benchmark_group!(name = my_group; benchmarks = not_generic_anymore);
fn main() {
main!(library_benchmark_groups = my_group);
}

That way you can still enjoy the convenience to use string literals instead of PathBuf in your #[bench] (or #[benches]) arguments and have clean benchmark metrics.

Comparing benchmark functions

Comparing benchmark functions is supported via the optional library_benchmark_group! argument compare_by_id (The default value for compare_by_id is false). Only benches with the same id are compared, which allows to single out cases which don't need to be compared. In the following example, the case_3 and multiple bench are compared with each other in addition to the usual comparison with the previous run:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[bench::case_3(vec![1, 2, 3])]
#[benches::multiple(args = [vec![1, 2], vec![1, 2, 3, 4]])]
fn bench_bubble_sort_best_case(input: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(input))
}

#[library_benchmark]
#[bench::case_3(vec![3, 2, 1])]
#[benches::multiple(args = [vec![2, 1], vec![4, 3, 2, 1]])]
fn bench_bubble_sort_worst_case(input: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(input))
}

library_benchmark_group!(
    name = bench_bubble_sort;
    compare_by_id = true;
    benchmarks = bench_bubble_sort_best_case, bench_bubble_sort_worst_case
);

fn main() {
main!(library_benchmark_groups = bench_bubble_sort);
}

Note if compare_by_id is true, all benchmark functions are compared with each other, so you are not limited to two benchmark functions per comparison group.

Here's the benchmark output of the above example to see what is happening:

my_benchmark::bubble_sort_group::bubble_sort_best_case case_2:vec! [1, 2]
  Instructions:                  63|N/A             (*********)
  L1 Hits:                       86|N/A             (*********)
  LL Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:              91|N/A             (*********)
  Estimated Cycles:             231|N/A             (*********)
my_benchmark::bubble_sort_group::bubble_sort_best_case multiple_0:vec! [1, 2, 3]
  Instructions:                  94|N/A             (*********)
  L1 Hits:                      123|N/A             (*********)
  LL Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             128|N/A             (*********)
  Estimated Cycles:             268|N/A             (*********)
my_benchmark::bubble_sort_group::bubble_sort_best_case multiple_1:vec! [1, 2, 3, 4]
  Instructions:                 136|N/A             (*********)
  L1 Hits:                      174|N/A             (*********)
  LL Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             179|N/A             (*********)
  Estimated Cycles:             319|N/A             (*********)
my_benchmark::bubble_sort_group::bubble_sort_worst_case case_2:vec! [2, 1]
  Instructions:                  66|N/A             (*********)
  L1 Hits:                       91|N/A             (*********)
  LL Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:              96|N/A             (*********)
  Estimated Cycles:             236|N/A             (*********)
  Comparison with bubble_sort_best_case case_2:vec! [1, 2]
  Instructions:                  63|66              (-4.54545%) [-1.04762x]
  L1 Hits:                       86|91              (-5.49451%) [-1.05814x]
  LL Hits:                        1|1               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:              91|96              (-5.20833%) [-1.05495x]
  Estimated Cycles:             231|236             (-2.11864%) [-1.02165x]
my_benchmark::bubble_sort_group::bubble_sort_worst_case multiple_0:vec! [3, 2, 1]
  Instructions:                 103|N/A             (*********)
  L1 Hits:                      138|N/A             (*********)
  LL Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             143|N/A             (*********)
  Estimated Cycles:             283|N/A             (*********)
  Comparison with bubble_sort_best_case multiple_0:vec! [1, 2, 3]
  Instructions:                  94|103             (-8.73786%) [-1.09574x]
  L1 Hits:                      123|138             (-10.8696%) [-1.12195x]
  LL Hits:                        1|1               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:             128|143             (-10.4895%) [-1.11719x]
  Estimated Cycles:             268|283             (-5.30035%) [-1.05597x]
my_benchmark::bubble_sort_group::bubble_sort_worst_case multiple_1:vec! [4, 3, 2, 1]
  Instructions:                 154|N/A             (*********)
  L1 Hits:                      204|N/A             (*********)
  LL Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             209|N/A             (*********)
  Estimated Cycles:             349|N/A             (*********)
  Comparison with bubble_sort_best_case multiple_1:vec! [1, 2, 3, 4]
  Instructions:                 136|154             (-11.6883%) [-1.13235x]
  L1 Hits:                      174|204             (-14.7059%) [-1.17241x]
  LL Hits:                        1|1               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:             179|209             (-14.3541%) [-1.16760x]
  Estimated Cycles:             319|349             (-8.59599%) [-1.09404x]

Iai-Callgrind result: Ok. 6 without regressions; 0 regressed; 6 benchmarks finished in 1.58123s

The procedure of the comparison algorithm:

Run all benches in the first benchmark function
Run the first bench in the second benchmark function and if there is a bench in the first benchmark function with the same id compare them
Run the second bench in the second benchmark function ...
...
Run the first bench in the third benchmark function and if there is a bench in the first benchmark function with the same id compare them. If there is a bench with the same id in the second benchmark function compare them.
Run the second bench in the third benchmark function ...
and so on ... until all benches are compared with each other

Neither the order nor the amount of benches within the benchmark functions matters, so it is not strictly necessary to mirror the bench ids of the first benchmark function in the second, third, etc. benchmark function.

Configuration

Library benchmarks can be configured with the LibraryBenchmarkConfig and with Command-line arguments and Environment variables.

The LibraryBenchmarkConfig can be specified at different levels and sets the configuration values for the same and lower levels. The values of the LibraryBenchmarkConfig at higher levels can be overridden at a lower level. Note that some values are additive rather than substitutive. Please see the docs of the respective functions in LibraryBenchmarkConfig for more details.

The different levels where a LibraryBenchmarkConfig can be specified.

At top-level with the main! macro

extern crate iai_callgrind;
use iai_callgrind::{library_benchmark, library_benchmark_group};
use iai_callgrind::{main, LibraryBenchmarkConfig};

#[library_benchmark] fn bench() {}
library_benchmark_group!(name = my_group; benchmarks = bench);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default();
    library_benchmark_groups = my_group
);
}

At group-level in the library_benchmark_group! macro

extern crate iai_callgrind;
use iai_callgrind::library_benchmark;
use iai_callgrind::{main, LibraryBenchmarkConfig, library_benchmark_group};

#[library_benchmark] fn bench() {}
library_benchmark_group!(
    name = my_group;
    config = LibraryBenchmarkConfig::default();
    benchmarks = bench
);

fn main() {
main!(library_benchmark_groups = my_group);
}

At #[library_benchmark] level

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    main, LibraryBenchmarkConfig, library_benchmark_group, library_benchmark
};
use std::hint::black_box;

#[library_benchmark(config = LibraryBenchmarkConfig::default())] 
fn bench() {
    /* ... */
}

library_benchmark_group!(
    name = my_group;
    config = LibraryBenchmarkConfig::default();
    benchmarks = bench
);

fn main() {
main!(library_benchmark_groups = my_group);
}

and at #[bench], #[benches] level

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    main, LibraryBenchmarkConfig, library_benchmark_group, library_benchmark
};
use std::hint::black_box;

#[library_benchmark] 
#[bench::some_id(args = (1, 2), config = LibraryBenchmarkConfig::default())]
#[benches::multiple(
    args = [(3, 4), (5, 6)], 
    config = LibraryBenchmarkConfig::default()
)]
fn bench(a: u8, b: u8) {
    /* ... */
    _ = (a, b);
}

library_benchmark_group!(
    name = my_group;
    config = LibraryBenchmarkConfig::default();
    benchmarks = bench
);

fn main() {
main!(library_benchmark_groups = my_group);
}

Output Format

The Iai-Callgrind output can be customized with command-line arguments. But, the fine-grained terminal output format is adjusted in the benchmark itself. For example truncating the description, showing a grid, .... Please read the docs for further details.

In this section, I want to point out the possibility to show the cache misses, and in the same manner cache miss rates and cache hit rates in the Iai-Callgrind output.

Showing cache misses

A default Iai-Callgrind benchmark run displays the following metrics:

test_lib_bench_readme_example_fibonacci::bench_fibonacci_group::bench_fibonacci short:10
  Instructions:                        1734|1734                 (No change)
  L1 Hits:                             2359|2359                 (No change)
  LL Hits:                                0|0                    (No change)
  RAM Hits:                               3|3                    (No change)
  Total read+write:                    2362|2362                 (No change)
  Estimated Cycles:                    2464|2464                 (No change)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

The cache and ram hits, Total read+write and Estimated Cycles are actually not part of the original collected callgrind metrics but calculated from them. If you want to see the cache misses nonetheless, you can achieve this by specifying the output format for example at top-level for all benchmarks in the same file in the main! macro:

extern crate iai_callgrind;
use iai_callgrind::{library_benchmark, library_benchmark_group};
use iai_callgrind::{main, LibraryBenchmarkConfig, CallgrindMetrics, Callgrind};

#[library_benchmark] fn bench() {}
library_benchmark_group!(name = my_group; benchmarks = bench);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::default()
            .format([CallgrindMetrics::All])
        );
    library_benchmark_groups = my_group
);
}

or by using the command-line argument --callgrind-metrics=@all or the environment variable IAI_CALLGRIND_CALLGRIND_METRICS=@all.

The Iai-Callgrind output will then show all cache metrics:

test_lib_bench_readme_example_fibonacci::bench_fibonacci_group::bench_fibonacci short:10
  Instructions:                        1734|N/A                  (*********)
  Dr:                                   270|N/A                  (*********)
  Dw:                                   358|N/A                  (*********)
  I1mr:                                   3|N/A                  (*********)
  D1mr:                                   0|N/A                  (*********)
  D1mw:                                   0|N/A                  (*********)
  ILmr:                                   3|N/A                  (*********)
  DLmr:                                   0|N/A                  (*********)
  DLmw:                                   0|N/A                  (*********)
  I1 Miss Rate:                     0.17301|N/A                  (*********)
  LLi Miss Rate:                    0.17301|N/A                  (*********)
  D1 Miss Rate:                     0.00000|N/A                  (*********)
  LLd Miss Rate:                    0.00000|N/A                  (*********)
  LL Miss Rate:                     0.12701|N/A                  (*********)
  L1 Hits:                             2359|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               3|N/A                  (*********)
  L1 Hit Rate:                      99.8730|N/A                  (*********)
  LL Hit Rate:                      0.00000|N/A                  (*********)
  RAM Hit Rate:                     0.12701|N/A                  (*********)
  Total read+write:                    2362|N/A                  (*********)
  Estimated Cycles:                    2464|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.48898s

The callgrind output format can be fully customized showing only the metrics you're interested in and in any order. The docs of Callgrind::format and CallgrindMetrics show all the possibilities for Callgrind. The output format of the other valgrind tools can be customized in the same way. More details can be found in the docs for the respective format (Dhat::format, DhatMetric, Cachegrind::format, CachegrindMetric, ...) and for their respective command-line arguments with --help.

Setting a tolerance margin for metric changes

Not every benchmark is deterministic, for example when hash maps or sets are involved or even just by using std::env::var in the benchmarked code. Benchmarks which show variances in the output of the metrics can be configured to tolerate a specific margin in the benchmark output:

extern crate iai_callgrind;
use std::collections::HashMap;
use std::hint::black_box;

use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, OutputFormat,
};

fn make_hashmap(num: usize) -> HashMap<String, usize> {
    (0..num).fold(HashMap::new(), |mut acc, e| {
        acc.insert(format!("element: {e}"), e);
        acc
    })
}

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .tolerance(0.9)
        )
)]
#[bench::tolerance(make_hashmap(100))]
fn bench_hash_map(map: HashMap<String, usize>) -> Option<usize> {
    black_box(
        map.iter()
            .find_map(|(key, value)| (key == "element: 12345").then_some(*value)),
    )
}

library_benchmark_group!(name = my_group; benchmarks = bench_hash_map);
fn main() {
main!(library_benchmark_groups = my_group);
}

or by using the command-line argument --tolerance=0.9 (or IAI_CALLGRIND_TOLERANCE=0.9).

The second or any following Iai-Callgrind run might then show something like that:

lib_bench_tolerance::my_group::bench_hash_map tolerance:make_hashmap(100)
  Instructions:                       19787|19623                (Tolerance)
  L1 Hits:                            26395|26123                (+1.04123%) [+1.01041x]
  LL Hits:                                0|0                    (No change)
  RAM Hits:                              22|22                   (No change)
  Total read+write:                   26417|26145                (+1.04035%) [+1.01040x]
  Estimated Cycles:                   27165|26893                (+1.01142%) [+1.01011x]

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.15735s

and Instructions displays Tolerance instead of a difference.

Custom entry points

The EntryPoint can be set to EntryPoint::None which disables the entry point, EntryPoint::Default which uses the benchmark function as entry point or EntryPoint::Custom which will be discussed in more detail in this chapter. This section is dedicated to the entry point of Callgrind. Dhat uses an entry point, too and although both are interpreted very similar there are differences which are fully described in the Dhat chapter.

To understand custom entry points let's take a small detour into how Callgrind and Iai-Callgrind work under the hood.

Iai-Callgrind under the hood

Callgrind collects metrics and associates them with a function. This happens based on the compiled code not the source code, so it is possible to hook into any function not only public functions. Callgrind can be configured to switch instrumentation on and off based on a function name with --toggle-collect. Per default, Iai-Callgrind sets this toggle (which we call EntryPoint) to the benchmarking function. Setting the toggle implies --collect-atstart=no. So, all events before (in the setup) and after the benchmark function (in the teardown) are not collected. Somewhat simplified, but conveying the basic idea, here is a commented example:

// <-- collect-at-start=no

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{main,library_benchmark_group, library_benchmark};
use std::hint::black_box;

#[library_benchmark]
fn bench() -> Vec<i32> { // <-- DEFAULT ENTRY POINT starts collecting events
    black_box(my_lib::bubble_sort(vec![3, 2, 1]))
} // <-- stop collecting events

library_benchmark_group!( name = my_group; benchmarks = bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

Pitfall: Inlined functions

The fact that Callgrind acts on the compiled code harbors a pitfall. The compiler with compile-time optimizations switched on (which is usually the case when compiling benchmarks) inlines functions if it sees an advantage in doing so. Iai-Callgrind takes care, that this doesn't happen with the benchmark function, so Callgrind can find and hook into the benchmark function. But, in your production code you actually don't want to stop the compiler from doing its job just to be able to benchmark that function. So, be cautious with benchmarking private functions and only choose functions of which it is known that they are not being inlined.

Hook into private functions

The basic idea is to choose a public function in your library acting as access point to the actual function you want to benchmark. As outlined before, this works only reliably for functions which are not inlined by the compiler.

extern crate iai_callgrind;
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint, Callgrind
};
use std::hint::black_box;

mod my_lib {
     #[inline(never)]
     fn bubble_sort(input: Vec<i32>) -> Vec<i32> {
         // The algorithm
       input
     }

     pub fn access_point(input: Vec<i32>) -> Vec<i32> {
         println!("Doing something before the function call");
         bubble_sort(input)
     }
}

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::default()
            .entry_point(EntryPoint::Custom("*::my_lib::bubble_sort".to_owned()))
        )
)]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_private(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::access_point(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_private);
fn main() {
main!(library_benchmark_groups = my_group);
}

Note the #[inline(never)] we use in this example to make sure the bubble_sort function is not getting inlined.

We use a wildcard *::my_lib::bubble_sort for EntryPoint::Custom for demonstration purposes. You might want to tighten this pattern. If you don't know how the pattern looks like, use EntryPoint::None first then run the benchmark. Now, investigate the callgrind output file. This output file is pretty low-level but all you need to do is search for the entries which start with fn=.... In the example above this entry might look like fn=algorithms::my_lib::bubble_sort if my_lib would be part of the top-level algorithms module. Or, using grep:

grep '^fn=.*::bubble_sort$' target/iai/the_package/benchmark_file_name/my_group/bench_private.bigger/callgrind.bench_private.bigger.out

Having found the pattern, you can eventually use EntryPoint::Custom.

Multi-threaded and multi-process applications

The default is to run Iai-Callgrind benchmarks with --separate-threads=yes, --trace-children=yes switched on. This enables Iai-Callgrind to trace threads and subprocesses, respectively. Note that --separate-threads=yes is not strictly necessary to be able to trace threads. But, if they are separated, Iai-Callgrind can collect and display the metrics for each thread. Due to the way callgrind applies data collection options like --toggle-collect, --collect-atstart, ... further configuration is needed in library benchmarks.

To actually see the collected metrics in the terminal output for all threads and/or subprocesses you can switch on OutputFormat::show_intermediate:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: u64) -> Vec<u64> { vec![]} }
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

#[library_benchmark]
fn bench_threads() -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(2))
}

library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

The best method for benchmarking threads and subprocesses depends heavily on your code. So, rather than suggesting a single "best" method for benchmarking threads and subprocesses, this chapter will run through various possible approaches and try to highlight the pros and cons of each.

Multi-threaded applications

Callgrind treats each thread and process as a separate unit and it applies data collection options to each unit. In library benchmarks the entry point (or the default toggle) for callgrind is per default set to the benchmark function with the help of the --toggle-collect option. Setting --toggle-collect also automatically sets --collect-atstart=no. If not further customized for a benchmarked multi-threaded function, these options cause the metrics for the spawned threads to be zero. This happens since each thread is a separate unit with --collect-atstart=no and the default toggle applied to the units. The default toggle is set to the benchmark function and does not hook into any function in the thread, so the metrics are zero.

There are multiple ways to customize the default behaviour and actually measure the threads. For the following examples, we're using the benchmark and library code below to show the different customization options assuming this code lives in a benchmark file benches/lib_bench_threads.rs

extern crate iai_callgrind;
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

/// Suppose this is your library
pub mod my_lib {
    /// Return true if `num` is a prime number
    pub fn is_prime(num: u64) -> bool {
        if num <= 1 {
            return false;
        }

        for i in 2..=(num as f64).sqrt() as u64 {
            if num % i == 0 {
                return false;
            }
        }

        true
    }

    /// Find and return all prime numbers in the inclusive range `low` to `high`
    pub fn find_primes(low: u64, high: u64) -> Vec<u64> {
        (low..=high).filter(|n| is_prime(*n)).collect()
    }

    /// Return the prime numbers in the range `0..(num_threads * 10000)`
    pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> {
        let mut handles = vec![];
        let mut low = 0;
        for _ in 0..num_threads {
            let handle = std::thread::spawn(move || find_primes(low, low + 10000));
            handles.push(handle);

            low += 10000;
        }

        let mut primes = vec![];
        for handle in handles {
            let result = handle.join();
            primes.extend(result.unwrap())
        }

        primes
    }
}

#[library_benchmark]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}

library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

Running this benchmark with cargo bench will present you with the following terminal output:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2097219 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66353|N/A                  (*********)
  LL Hits:                              341|N/A                  (*********)
  RAM Hits:                             539|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86923|N/A                  (*********)
  ## pid: 2097219 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2097219 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66353|N/A                  (*********)
  LL Hits:                              341|N/A                  (*********)
  RAM Hits:                             539|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86923|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 1.19222s

As you can see, the counts for the threads 2 and 3 (our spawned threads) are all zero.

Measuring threads using toggles

At a first glance, setting a toggle to the function in the thread seems to be easiest way and can be done like so:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint, Callgrind
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::with_args(["--toggle-collect=lib_bench_threads::my_lib::find_primes"]))
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

This approach may or may not work, depending on whether the compiler inlines the target function of the --toggle-collect argument or not. This is the same problem as with custom entry points. As can be seen below, the compiler has chosen to inline find_primes and the metrics for the threads are still zero:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2620776 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  LL Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)
  ## pid: 2620776 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2620776 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  LL Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 1.19222s

Just to show what would happen if the compiler does not inline the find_primes method, we temporarily annotate it with #[inline(never)]:

#![allow(unused)]
fn main() {
/// Find and return all prime numbers in the inclusive range `low` to `high`
fn is_prime(_: u64) -> bool { true }
#[inline(never)]
pub fn find_primes(low: u64, high: u64) -> Vec<u64> {
    (low..=high).filter(|n| is_prime(*n)).collect()
}
}

Now, running the benchmark does show the desired metrics:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2661917 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  LL Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)
  ## pid: 2661917 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460503|N/A                  (*********)
  L1 Hits:                          2534938|N/A                  (*********)
  LL Hits:                               12|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                 2535136|N/A                  (*********)
  Estimated Cycles:                 2541508|N/A                  (*********)
  ## pid: 2661917 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650410|N/A                  (*********)
  L1 Hits:                          3724286|N/A                  (*********)
  LL Hits:                                4|N/A                  (*********)
  RAM Hits:                             130|N/A                  (*********)
  Total read+write:                 3724420|N/A                  (*********)
  Estimated Cycles:                 3728856|N/A                  (*********)
  ## Total
  Instructions:                     6138285|N/A                  (*********)
  L1 Hits:                          6325655|N/A                  (*********)
  LL Hits:                              359|N/A                  (*********)
  RAM Hits:                             854|N/A                  (*********)
  Total read+write:                 6326868|N/A                  (*********)
  Estimated Cycles:                 6357340|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 1.19222s

But, annotating functions with #[inline(never)] in production code is usually not an option and preventing the compiler from doing its job is not the preferred way to make a benchmark work. The truth is, there is no way to make the --toggle-collect argument work for all cases and it heavily depends on the choices of the compiler depending on your code.

Another way to get the thread metrics is to set --collect-atstart=yes and turn off the EntryPoint:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint, Callgrind
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::with_args(["--collect-atstart=yes"])
            .entry_point(EntryPoint::None)
        )
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

But, the metrics of the main thread will include all the setup (and teardown) code from the benchmark executable (so the instructions of the main thread go up from 27372 to 404425):

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2697019 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                      404425|N/A                  (*********)
  L1 Hits:                           570186|N/A                  (*********)
  LL Hits:                             1307|N/A                  (*********)
  RAM Hits:                            4856|N/A                  (*********)
  Total read+write:                  576349|N/A                  (*********)
  Estimated Cycles:                  746681|N/A                  (*********)
  ## pid: 2697019 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2466864|N/A                  (*********)
  L1 Hits:                          2543314|N/A                  (*********)
  LL Hits:                               81|N/A                  (*********)
  RAM Hits:                             409|N/A                  (*********)
  Total read+write:                 2543804|N/A                  (*********)
  Estimated Cycles:                 2558034|N/A                  (*********)
  ## pid: 2697019 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3656729|N/A                  (*********)
  L1 Hits:                          3732802|N/A                  (*********)
  LL Hits:                               31|N/A                  (*********)
  RAM Hits:                             201|N/A                  (*********)
  Total read+write:                 3733034|N/A                  (*********)
  Estimated Cycles:                 3739992|N/A                  (*********)
  ## Total
  Instructions:                     6528018|N/A                  (*********)
  L1 Hits:                          6846302|N/A                  (*********)
  LL Hits:                             1419|N/A                  (*********)
  RAM Hits:                            5466|N/A                  (*********)
  Total read+write:                 6853187|N/A                  (*********)
  Estimated Cycles:                 7044707|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

Additionally, expect a lot of metric changes if the benchmarks itself are changed. However, if the metrics of the main thread are not significant compared to the total, this might be an applicable (last) choice.

There is another more reliable way as shown below in the next section.

Measuring threads using client requests

The perhaps most reliable and flexible way to measure threads is using client requests. The downside is that you have to put some benchmark code into your production code. But, if you followed the installation instructions in client requests, this additional code is only compiled in benchmarks, not in your final production-ready library.

Using the callgrind client request, we adjust the threads in the find_primes_multi_thread function like so:

#![allow(unused)]
fn main() {
fn find_primes(_a: u64, _b: u64) -> Vec<u64> { vec![] }
extern crate iai_callgrind;
use iai_callgrind::client_requests::callgrind;

/// Return the prime numbers in the range `0..(num_threads * 10000)`
pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> {
    let mut handles = vec![];
    let mut low = 0;
    for _ in 0..num_threads {
        let handle = std::thread::spawn(move || {
            callgrind::toggle_collect();
            let result = find_primes(low, low + 10000);
            callgrind::toggle_collect();
            result
        });
        handles.push(handle);

        low += 10000;
    }

    let mut primes = vec![];
    for handle in handles {
        let result = handle.join();
        primes.extend(result.unwrap())
    }

    primes
}
}

and running the same benchmark now will show the collected metrics of the threads:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2149242 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66352|N/A                  (*********)
  LL Hits:                              344|N/A                  (*********)
  RAM Hits:                             537|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86867|N/A                  (*********)
  ## pid: 2149242 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460501|N/A                  (*********)
  L1 Hits:                          2534935|N/A                  (*********)
  LL Hits:                               13|N/A                  (*********)
  RAM Hits:                             185|N/A                  (*********)
  Total read+write:                 2535133|N/A                  (*********)
  Estimated Cycles:                 2541475|N/A                  (*********)
  ## pid: 2149242 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650408|N/A                  (*********)
  L1 Hits:                          3724285|N/A                  (*********)
  LL Hits:                                1|N/A                  (*********)
  RAM Hits:                             131|N/A                  (*********)
  Total read+write:                 3724417|N/A                  (*********)
  Estimated Cycles:                 3728875|N/A                  (*********)
  ## Total
  Instructions:                     6138214|N/A                  (*********)
  L1 Hits:                          6325572|N/A                  (*********)
  LL Hits:                              358|N/A                  (*********)
  RAM Hits:                             853|N/A                  (*********)
  Total read+write:                 6326783|N/A                  (*********)
  Estimated Cycles:                 6357217|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

Using the client request toggles is very flexible since you can put the iai_callgrind::client_requests::callgrind::toggle_collect instructions anywhere in the threads. In this example, we just have a single function in the thread, but if your threads consist of more than just a single function, you can easily exclude uninteresting parts from the final measurements.

If you want to prevent the code of the main thread from being measured, you can use the following:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint, Callgrind
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::with_args(["--collect-atstart=no"])
            .entry_point(EntryPoint::None)
        )
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

Setting the EntryPoint::None disables the default toggle but also --collect-atstart=no, which is why we have to set the option manually. Altogether, running the benchmark will show:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2251257 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2251257 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460501|N/A                  (*********)
  L1 Hits:                          2534935|N/A                  (*********)
  LL Hits:                               11|N/A                  (*********)
  RAM Hits:                             187|N/A                  (*********)
  Total read+write:                 2535133|N/A                  (*********)
  Estimated Cycles:                 2541535|N/A                  (*********)
  ## pid: 2251257 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650408|N/A                  (*********)
  L1 Hits:                          3724282|N/A                  (*********)
  LL Hits:                                4|N/A                  (*********)
  RAM Hits:                             131|N/A                  (*********)
  Total read+write:                 3724417|N/A                  (*********)
  Estimated Cycles:                 3728887|N/A                  (*********)
  ## Total
  Instructions:                     6110909|N/A                  (*********)
  L1 Hits:                          6259217|N/A                  (*********)
  LL Hits:                               15|N/A                  (*********)
  RAM Hits:                             318|N/A                  (*********)
  Total read+write:                 6259550|N/A                  (*********)
  Estimated Cycles:                 6270422|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

Multi-process applications

Measuring multi-process applications is in principal not that different from multi-threaded applications since subprocesses are just like threads separate units. As for threads, the data collection options are applied to subprocesses separately from the main process.

Note there are multiple valgrind command-line arguments that can disable the collection of metrics for uninteresting subprocesses, for example subprocesses that are spawned by your library function but are not part of your library/binary crate.

For the following examples suppose the code below is the cat binary and part of a crate (so we can use env!("CARGO_BIN_EXE_cat")):

use std::fs::File;
use std::io::{copy, stdout, BufReader, BufWriter, Write};

fn main() {
fn main() {
    let mut args_iter = std::env::args().skip(1);
    let file_arg = args_iter.next().expect("File argument should be present");

    let file = File::open(file_arg).expect("Opening file should succeed");
    let stdout = stdout().lock();

    let mut writer = BufWriter::new(stdout);
    copy(&mut BufReader::new(file), &mut writer)
        .expect("Printing file to stdout should succeed");

    writer.flush().expect("Flushing writer should succeed");
}
}

The above binary is a very simple version of cat taking a single file argument. The file content is read and dumped to the stdout. The following is the benchmark and library code to show the different options assuming this code is stored in a benchmark file benches/lib_bench_subprocess.rs

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;

use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    OutputFormat,
};

/// Suppose this is your library
pub mod my_lib {
    use std::io;
    use std::path::Path;
    use std::process::ExitStatus;

    /// A function executing the crate's binary `cat`
    pub fn cat(file: &Path) -> io::Result<ExitStatus> {
        std::process::Command::new(env!("CARGO_BIN_EXE_cat"))
            .arg(file)
            .status()
    }
}

/// Create a file `/tmp/foo.txt` with some content
fn create_file() -> PathBuf {
    let path = PathBuf::from("/tmp/foo.txt");
    std::fs::write(&path, "some content").unwrap();
    path
}

#[library_benchmark]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}

library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

Running the above benchmark with cargo bench results in the following terminal output:

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3141785 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  LL Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)
  ## pid: 3141786 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  LL Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

As expected, the cat subprocess is not measured and the metrics are zero for the same reasons as the initial measurement of threads.

Measuring subprocesses using toggles

The great advantage over measuring threads is that each process has a main function that is not inlined by the compiler and can serve as a reliable hook for the --toggle-collect argument so the following adaption to the above benchmark will just work:

extern crate iai_callgrind;
mod my_lib {
use std::{io, path::Path, process::ExitStatus};
pub fn cat(_: &Path) -> io::Result<ExitStatus> {
   std::process::Command::new("some").status()
}}
fn create_file() -> PathBuf { PathBuf::from("some") }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;
use iai_callgrind::{
   library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
   OutputFormat, Callgrind
};
#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::with_args(["--toggle-collect=cat::main"]))
)]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}
library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(library_benchmark_groups = my_group);
}

producing the desired output

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3324117 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4475|N/A                  (*********)
  L1 Hits:                             6112|N/A                  (*********)
  LL Hits:                               14|N/A                  (*********)
  RAM Hits:                             187|N/A                  (*********)
  Total read+write:                    6313|N/A                  (*********)
  Estimated Cycles:                   12727|N/A                  (*********)
  ## pid: 3324119 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                        4019|N/A                  (*********)
  L1 Hits:                             5575|N/A                  (*********)
  LL Hits:                               12|N/A                  (*********)
  RAM Hits:                             167|N/A                  (*********)
  Total read+write:                    5754|N/A                  (*********)
  Estimated Cycles:                   11480|N/A                  (*********)
  ## Total
  Instructions:                        8494|N/A                  (*********)
  L1 Hits:                            11687|N/A                  (*********)
  LL Hits:                               26|N/A                  (*********)
  RAM Hits:                             354|N/A                  (*********)
  Total read+write:                   12067|N/A                  (*********)
  Estimated Cycles:                   24207|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

Measuring subprocesses using client requests

Naturally, client requests can also be used to measure subprocesses. The callgrind client requests are added to the code of the cat binary:

extern crate iai_callgrind;
use std::fs::File;
use std::io::{copy, stdout, BufReader, BufWriter, Write};
use iai_callgrind::client_requests::callgrind;

fn main() {
fn main() {
    let mut args_iter = std::env::args().skip(1);
    let file_arg = args_iter.next().expect("File argument should be present");

    callgrind::toggle_collect();
    let file = File::open(file_arg).expect("Opening file should succeed");
    let stdout = stdout().lock();

    let mut writer = BufWriter::new(stdout);
    copy(&mut BufReader::new(file), &mut writer)
        .expect("Printing file to stdout should succeed");

    writer.flush().expect("Flushing writer should succeed");
    callgrind::toggle_collect();
}
}

For the purpose of this example we decided that measuring the parsing of the command-line-arguments is not interesting for us and excluded it from the collected metrics. The benchmark itself is reverted to its original state without the toggle:

extern crate iai_callgrind;
mod my_lib {
use std::{io, path::Path, process::ExitStatus};
pub fn cat(_: &Path) -> io::Result<ExitStatus> {
   std::process::Command::new("some").status()
}}
fn create_file() -> PathBuf { PathBuf::from("some") }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;
use iai_callgrind::{
   library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
   OutputFormat,
};
#[library_benchmark]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}
library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(library_benchmark_groups = my_group);
}

Now, running the benchmark shows

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3421822 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  LL Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)
  ## pid: 3421823 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                        2429|N/A                  (*********)
  L1 Hits:                             3406|N/A                  (*********)
  LL Hits:                                8|N/A                  (*********)
  RAM Hits:                             138|N/A                  (*********)
  Total read+write:                    3552|N/A                  (*********)
  Estimated Cycles:                    8276|N/A                  (*********)
  ## Total
  Instructions:                        6896|N/A                  (*********)
  L1 Hits:                             9508|N/A                  (*********)
  LL Hits:                               25|N/A                  (*********)
  RAM Hits:                             324|N/A                  (*********)
  Total read+write:                    9857|N/A                  (*********)
  Estimated Cycles:                   20973|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

As expected, the metrics for the cat binary are a little bit lower since we skipped measuring the parsing of the command-line arguments.

Even more Examples

I'm referring here to the github repository. We test the library benchmarks functionality of Iai-Callgrind with system tests in the private benchmark-tests package.

Each system test there can serve you as an example, but for a fully documented and commented one see here.

Binary Benchmarks

You want to start benchmarking your crate's binary? Best start with the Quickstart section.

Setting up binary benchmarks is very similar to library benchmarks, and it's a good idea to have a look at the library benchmark section of this guide, too.

You may then come back to the binary benchmarks section and go through the differences

If you need more examples see here.

Important default behaviour

As in library benchmarks, the environment variables are cleared before running a binary benchmark. Have a look at the Configuration section if you want to change this behavior. Iai-Callgrind sometimes deviates from the valgrind defaults which are:

Iai-Callgrind	Valgrind (v3.23)
`--trace-children=yes`	`--trace-children=no`
`--fair-sched=try`	`--fair-sched=no`
`--separate-threads=yes`	`--separate-threads=no`
`--cache-sim=yes`	`--cache-sim=no`

As show in the table above, the benchmarks run with cache simulation switched on. This adds run time for each benchmark. If you don't need the cache metrics and estimation of cycles, you can easily switch cache simulation off for example with

#![allow(unused)]
fn main() {
extern crate iai_callgrind;
use iai_callgrind::{BinaryBenchmarkConfig, Callgrind};

BinaryBenchmarkConfig::default().tool(Callgrind::with_args(["--cache-sim=no"]));
}

To switch off cache simulation for all benchmarks in the same file:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig,
    Callgrind
};

#[binary_benchmark]
fn bench_binary() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(
    config = BinaryBenchmarkConfig::default()
        .tool(Callgrind::with_args(["--cache-sim=no"]));
    binary_benchmark_groups = my_group
);
}

Quickstart

Suppose the crate's binary is called my-foo and this binary takes a file path as positional argument. This first example shows the basic usage of the high-level api with the #[binary_benchmark] attribute:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};

#[binary_benchmark]
#[bench::some_id("foo.txt")]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(
    name = my_group;
    benchmarks = bench_binary
);

fn main() {
main!(binary_benchmark_groups = my_group);
}

If you want to try out this example with your crate's binary, put the above code into a file in $WORKSPACE_ROOT/benches/binary_benchmark.rs. Next, replace my-foo in env!("CARGO_BIN_EXE_my-foo") with the name of a binary of your crate.

Note the env! macro is a rust builtin macro and CARGO_BIN_EXE_<name> is documented rust stdlib.

You should always use env!("CARGO_BIN_EXE_<name>") to determine the path to the binary of your crate. Do not use relative paths like target/release/my-foo since this might break your benchmarks in many ways. The environment variable does exactly the right thing and the usage is short and simple.

Lastly, adjust the argument of the Command and add the following to your Cargo.toml:

[[bench]]
name = "binary_benchmark"
harness = false

Running

cargo bench

presents you with something like the following:

binary_benchmark::my_group::bench_binary some_id:("foo.txt") -> target/release/my-foo foo.txt
  Instructions:              342129|N/A             (*********)
  L1 Hits:                   457370|N/A             (*********)
  LL Hits:                      734|N/A             (*********)
  RAM Hits:                    4096|N/A             (*********)
  Total read+write:          462200|N/A             (*********)
  Estimated Cycles:          604400|N/A             (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

As opposed to library benchmarks, binary benchmarks have access to a low-level api. Here, pretty much the same as the above high-level usage but written in the low-level api:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{BinaryBenchmark, Bench, binary_benchmark_group, main};

binary_benchmark_group!(
    name = my_group;
    benchmarks = |group: &mut BinaryBenchmarkGroup| {
        group.binary_benchmark(BinaryBenchmark::new("bench_binary")
            .bench(Bench::new("some_id")
                .command(iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
                    .arg("foo.txt")
                    .build()
                )
            )
        )
    }
);

fn main() {
main!(binary_benchmark_groups = my_group);
}

If in doubt, use the high-level api. You can still migrate to the low-level api very easily if you really need to. The other way around is more involved.

Differences to library benchmarks

In this section we're going through the differences to library benchmarks. This assumes that you already know how to set up library benchmarks, and it is recommended to learn the very basics about library benchmarks, starting with Quickstart, Structure of a library benchmark and The macros in more detail. Then come back to this section.

Name changes

Coming from library benchmarks, the names with library in it change to the same name but library with binary replaced, so the #[library_benchmark] attribute's name changes to #[binary_benchmark] and library_benchmark_group! changes to binary_benchmark_group!, the config arguments take a BinaryBenchmarkConfig instead of a LibraryBenchmarkConfig...

A quick reference of available macros in binary benchmarks:

#[binary_benchmark] and its inner attributes #[bench] and #[benches]: The exact pendant to the #[library_benchmark] attribute macro.
binary_benchmark_group!: Just the name of the macro has changed.
binary_benchmark_attribute!: An additional macro if you intend to migrate from the high-level to the low-level api
main!: The same macro as in library benchmarks but the name of the library_benchmark_groups parameter changed to binary_benchmark_groups.

To see all macros in action have a look at the example below.

The return value of the benchmark function

The maybe most important difference is, that the #[binary_benchmark] annotated function always needs to return an iai_callgrind::Command. Note this function builds the command which is going to be benchmarked but doesn't execute it, yet. So, the code in this function does not attribute to the event counts of the actual benchmark.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};
use std::path::PathBuf;

#[binary_benchmark]
#[bench::foo("foo.txt")]
#[bench::bar("bar.json")]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    // We can put any code in this function which is needed to configure and
    // build the `Command`.
    let path = PathBuf::from(path);

    // Here, if the `path` ends with `.txt` we want to see
    // the `Stdout` output of the `Command` in the benchmark output. In all other
    // cases, the `Stdout` of the `Command` is redirected to a `File` with the
    // same name as the input `path` but with the extension `out`.
    let stdout = if path.extension().unwrap() == "txt" {
        iai_callgrind::Stdio::Inherit
    } else {
        iai_callgrind::Stdio::File(path.with_extension("out"))
    };
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .stdout(stdout)
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

`setup` and `teardown`

Since we can put any code building the Command in the function itself, the setup and teardown of #[binary_benchmark], #[bench] and #[benches] work differently.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};

fn create_file() {
    std::fs::write("foo.txt", "some content").unwrap();
}

#[binary_benchmark]
#[bench::foo(args = ("foo.txt"), setup = create_file())]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

setup, which is here the expression create_file(), is not evaluated right away and the return value of setup is not used as input for the function! Instead, the expression in setup is getting evaluated and executed just before the benchmarked Command is executed. Similarly, teardown is executed after the Command is executed.

In the example above, setup creates always the same file and is pretty static. It's possible to use the same arguments for setup (teardown) and the function using the path (or file pointer) to a function as you're used to from library benchmarks:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};

fn create_file(path: &str) {
    std::fs::write(path, "some content").unwrap();
}

fn delete_file(path: &str) {
    std::fs::remove_file(path).unwrap();
}

#[binary_benchmark]
// Note the missing parentheses for `setup` of the function `create_file` which
// tells Iai-Callgrind to pass the `args` to the `setup` function AND the
// function `bench_binary`
#[bench::foo(args = ("foo.txt"), setup = create_file)]
// Same for `teardown`
#[bench::bar(args = ("bar.txt"), setup = create_file, teardown = delete_file)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

The Command's stdin and simulating piped input

The behaviour of Stdin of the Command can be changed, almost the same way as the Stdin of a std::process::Command with the only difference, that we use the enums iai_callgrind::Stdin and iai_callgrind::Stdio. These enums provide the variants Inherit (the equivalent of std::process::Stdio::inherit), Pipe (the equivalent of std::process::Stdio::piped) and so on. There's also File which takes a PathBuf to the file which is used as Stdin for the Command. This corresponds to a redirection in the shell as in my-foo < path/to/file.

Moreover, iai_callgrind::Stdin provides the Stdin::Setup variant specific to Iai-Callgrind:

Applications may change their behaviour if the input or the Stdin of the Command is coming from a pipe as in echo "some content" | my-foo. To be able to benchmark such cases, it is possible to use the output of setup to Stdout or Stderr as Stdin for the Command.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main, Stdin, Pipe};

fn setup_pipe() {
    println!(
        "The output to `Stdout` here will be the input or `Stdin` of the `Command`"
    );
}

#[binary_benchmark]
#[bench::foo(setup = setup_pipe())]
fn bench_binary() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .stdin(Stdin::Setup(Pipe::Stdout))
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

Usually, setup then the Command and then teardown are executed sequentially, each waiting for the previous process to exit successfully (See also Configure the exit code of the Command). If the Command::stdin changes to Stdin::Setup, setup and the Command are executed in parallel and Iai-Callgrind waits first for the Command to exit, then setup. After the successful exit of setup, teardown is executed.

Since setup and Command are run in parallel if Stdin::Setup is used, it is sometimes necessary to delay the execution of the Command. Please see the delay chapter for more details.

Configuration

The configuration of binary benchmarks works the same way as in library benchmarks with the name changing from LibraryBenchmarkConfig to BinaryBenchmarkConfig. Please see there for the basics. However, Binary benchmarks have some additional configuration possibilities:

Delay the Command

Delaying the execution of the Command with Command::delay might be necessary if the setup is executed in parallel either with Command::setup_parallel or Command::stdin set to Stdin::Setup.

For example, if you have a server which needs to be started in the setup to be able to benchmark a client (in our example a crate's binary simply named client):

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use std::net::{SocketAddr, TcpListener};
use std::time::Duration;
use std::thread;

use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, Delay, DelayKind
};

const ADDRESS: &str = "127.0.0.1:31000";

fn setup_tcp_server() {
    println!("Waiting to start server...");
    thread::sleep(Duration::from_millis(300));

    println!("Starting server...");
    let listener = TcpListener::bind(
            ADDRESS.parse::<SocketAddr>().unwrap()
        ).unwrap();

    thread::sleep(Duration::from_secs(1));

    drop(listener);
    println!("Stopped server...");
}

#[binary_benchmark(setup = setup_tcp_server())]
fn bench_client() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_client"))
        .setup_parallel(true)
        .delay(
            Delay::new(DelayKind::TcpConnect(
                ADDRESS.parse::<SocketAddr>().unwrap(),
            ))
            .timeout(Duration::from_millis(500)),
        )
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_client);
fn main() {
main!(binary_benchmark_groups = my_group);
}

The server is started in the parallel setup function setup_tcp_server since Command::setup_parallel is set to true. The delay of the Command is configured with Delay in Command::delay to wait for the tcp connection to be available. We also applied a timeout of 500 milliseconds with Delay::timeout, so if something goes wrong in the server and the tcp connection cannot be established, the benchmark exits with an error after 500 milliseconds instead of hanging forever. After the successful delay, the actual client is executed and benchmarked. After the exit of the client, the setup is waited for to exit successfully. Then, if present, the teardown function is executed.

Please see the library documentation for all possible DelayKinds and more details on the Delay.

Sandbox

The Sandbox is a temporary directory which is created before the execution of the setup and deleted after the teardown. setup, the Command and teardown are executed inside this temporary directory. This simply describes the order of the execution but the setup or teardown don't need to be present.

Why using a Sandbox?

A Sandbox can help mitigating differences in benchmark results on different machines. As long as $TMP_DIR is unset or set to /tmp, the temporary directory has a constant length on unix machines (except android which uses /data/local/tmp). The directory itself is created with a constant length but random name like /tmp/.a23sr8fk.

It is not implausible that an executable has different event counts just because the directory it is executed in has a different length. For example, if a member of your project has set up the project in /home/bob/workspace/our-project running the benchmarks in this directory, and the ci runs the benchmarks in /runner/our-project, the event counts might differ. If possible, the benchmarks should be run in a constant environment. For example clearing the environment variables is also such a measure.

Other good reasons for using a Sandbox are convenience, e.g. if you create files during the setup and Command run and do not want to delete all files manually. Or, maybe more importantly, if the Command is destructive and deletes files, it is usually safer to run such a Command in a temporary directory where it cannot cause damage to your or other file systems.

The Sandbox is deleted after the benchmark, regardless of whether the benchmark run was successful or not. The latter is not guaranteed if you only rely on teardown, as teardown is only executed if the Command returns without error.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox
};

fn create_file(path: &str) {
    std::fs::write(path, "some content").unwrap();
}

#[binary_benchmark]
#[bench::foo(
    args = ("foo.txt"),
    config = BinaryBenchmarkConfig::default().sandbox(Sandbox::new(true)),
    setup = create_file
)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

In this example, as part of the setup, the create_file function with the argument foo.txt is executed in the Sandbox before the Command is executed. The Command is executed in the same Sandbox and therefore the file foo.txt with the content some content exists thanks to the setup. After the execution of the Command, the Sandbox is completely removed, deleting all files created during setup, the Command execution (and teardown if it had been present in this example).

Since setup is run in the sandbox, you can't copy fixtures from your project's workspace into the sandbox that easily anymore. The Sandbox can be configured to copy fixtures into the temporary directory with Sandbox::fixtures:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox
};

#[binary_benchmark]
#[bench::foo(
    args = ("foo.txt"),
    config = BinaryBenchmarkConfig::default()
        .sandbox(Sandbox::new(true)
            .fixtures(["benches/foo.txt"])),
)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

The above will copy the fixture file foo.txt in the benches directory into the sandbox root as foo.txt. Relative paths in Sandbox::fixtures are interpreted relative to the workspace root. In a multi-crate workspace this is the directory with the top-level Cargo.toml file. Paths in Sandbox::fixtures are not limited to files, they can be directories, too.

If you have more complex demands, you can access the workspace root via the environment variable _WORKSPACE_ROOT in setup and teardown. Suppose, there is a fixture located in /home/the_project/foo_crate/benches/fixtures/foo.txt with the_project being the workspace root and foo_crate a workspace member with the my-foo executable. If the command is expected to create a file bar.json, which needs further inspection after the benchmarks have run, let's copy it into a temporary directory tmp (which may or may not exist) in foo_crate:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox
};
use std::path::PathBuf;

fn copy_fixture(path: &str) {
    let workspace_root = PathBuf::from(std::env::var_os("_WORKSPACE_ROOT").unwrap());
    std::fs::copy(
        workspace_root.join("foo_crate").join("benches").join("fixtures").join(path),
        path
    );
}

// This function will fail if `bar.json` does not exist, which is fine as this
// file is expected to be created by `my-foo`. So, if this file does not exist,
// an error will occur and the benchmark will fail. Although benchmarks are not
// expected to test the correctness of the application, the `teardown` can be
// used to check postconditions for a successful command run.
fn copy_back(path: &str) {
    let workspace_root = PathBuf::from(std::env::var_os("_WORKSPACE_ROOT").unwrap());
    let dest_dir = workspace_root.join("foo_crate").join("tmp");
    if !dest_dir.exists() {
        std::fs::create_dir(&dest_dir).unwrap();
    }
    std::fs::copy(path, dest_dir.join(path));
}

#[binary_benchmark]
#[bench::foo(
    args = ("foo.txt"),
    config = BinaryBenchmarkConfig::default().sandbox(Sandbox::new(true)),
    setup = copy_fixture,
    teardown = copy_back("bar.json")
)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

Configure the exit code of the Command

Usually, if a Command exits with a non-zero exit code, the whole benchmark run fails and stops. If the exit code of the benchmarked Command is to be expected different from 0, the expected exit code can be set in BinaryBenchmarkConfig::exit_with or Command::exit_with:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
     binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, ExitWith
};

#[binary_benchmark]
// Here, we set the expected exit code of `my-foo` to 2
#[bench::exit_with_2(
    config = BinaryBenchmarkConfig::default().exit_with(ExitWith::Code(2))
)]
// Here, we don't know the exact exit code but know it is different from 0 (=success)
#[bench::exit_with_failure(
    config = BinaryBenchmarkConfig::default().exit_with(ExitWith::Failure)
)]
fn bench_binary() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

Low-level api

I'm not going into full detail of the low-level api here since it is fully documented in the api Documentation.

The basic structure

The entry point of the low-level api is the binary_benchmark_group

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
     binary_benchmark, binary_benchmark_attribute, binary_benchmark_group, main,
     BinaryBenchmark, Bench
};

binary_benchmark_group!(
    name = my_group;
    benchmarks = |group: &mut BinaryBenchmarkGroup| {
        group.binary_benchmark(BinaryBenchmark::new("bench_binary")
            .bench(Bench::new("some_id")
                .command(iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
                    .arg("foo.txt")
                    .build()
                )
            )
        )
    }
);

fn main() {
main!(binary_benchmark_groups = my_group);
}

The low-level api mirrors the high-level api, "structifying" the macros.

The binary_benchmark_group! is also a struct now, the BinaryBenchmarkGroup. It cannot be instantiated. Instead, it is passed as argument to the expression of the benchmarks parameter in a binary_benchmark_group. You can choose any name instead of group, we just have used group throughout the examples.

There's the shorter benchmarks = |group| /* ... */ instead of benchmarks = |group: &mut BinaryBenchmarkGroup| /* ... */. We use the more verbose variant in the examples because it is more informative for benchmarking starters.

Furthermore, the #[library_benchmark] macro correlates with iai_callgrind::LibraryBenchmark and #[bench] with iai_callgrind::Bench. The parameters of the macros are now functions in the respective structs. The return value of the benchmark function, the iai-callgrind::Command, is now also a function iai-callgrind::Bench::command.

Note there is no iai-callgrind::Benches struct since specifying multiple commands with iai_callgrind::Bench::command behaves exactly the same way as the #[benches] attribute. So, the file parameter of #[benches] is a part of iai-callgrind::Bench and can be used with the iai-callgrind::Bench::file function.

Intermixing high-level and low-level api

It is recommended to start with the high-level api using the #[binary_benchmark] attribute, since you can fall back to the low-level api in a few steps with the binary_benchmark_attribute! macro as shown below. The other way around is much more involved.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
     binary_benchmark, binary_benchmark_attribute, binary_benchmark_group, main,
     BinaryBenchmark, Bench
};

#[binary_benchmark]
#[bench::some_id("foo")]
fn attribute_benchmark(arg: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-binary"))
        .arg(arg)
        .build()
}

binary_benchmark_group!(
    name = low_level;
    benchmarks = |group: &mut BinaryBenchmarkGroup| {
        group
            .binary_benchmark(binary_benchmark_attribute!(attribute_benchmark))
            .binary_benchmark(
                BinaryBenchmark::new("low_level_benchmark")
                    .bench(
                        Bench::new("some_id").command(
                            iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-binary"))
                                .arg("bar")
                                .build()
                        )
                    )
            )
    }
);

fn main() {
main!(binary_benchmark_groups = low_level);
}

As shown above, there's no need to transcribe the function attribute_benchmark with the #[binary_benchmark] attribute into the low-level api structures. Just keep it as it is and add it to a the group with group.binary_benchmark(binary_benchmark_attribute(attribute_benchmark)). That's it! You can continue hacking on your benchmarks in the low-level api.

More examples needed?

As in library benchmarks, I'm referring here to the github repository. The binary benchmarks functionality of Iai-Callgrind is tested with system tests in the private benchmark-tests package.

Each system test there can serve you as an example, but for a fully documented and commented one see here.

Detecting Performance Regressions

With Iai-Callgrind you can define limits for each callgrind/cachegrind event kind or dhat metric over which a performance regression can be assumed. Per default, Iai-Callgrind does not perform regression checks, and you have to opt-in with Callgrind::soft_limits, Callgrind::hard_limits, Cachegrind::soft_limits, ... at benchmark level in LibraryBenchmarkConfig::tool or BinaryBenchmarkConfig::tool or at a more global level with Command-line arguments or Environment variables, see below.

For a soft limit, a performance regression check consists of an EventKind, CachegrindMetric or DhatMetric and a percentage. If the percentage is negative, then a regression is assumed to be below this limit. Hard limits restrict the EventKind, ... by an absolute number.

Note that comparing baselines also detects performance regressions. This can be useful, for example, when setting up Iai-Callgrind in the CI to cause a PR to fail when comparing to the main branch.

Regressions are considered errors and will cause the benchmark to fail if they occur, and Iai-Callgrind will exit with error code 3.

Defining limits on the command-line

Limits can be defined on the command-line for the following tools with --callgrind-limits (IAI_CALLGRIND_CALLGRIND_LIMITS), --cachegrind-limits (IAI_CALLGRIND_CACHEGRIND_LIMITS) and --dhat-limits (IAI_CALLGRIND_DHAT_LIMITS). Command-line limits overwrite the limits specified in the benchmark file (see below).

In order to disambiguate between soft and hard limits, soft limits have to be suffixed with a %. Hard limits are bare numbers. For example to limit the total instructions executed ir (printed as Instructions in the callgrind terminal output) to 5%:

cargo bench --bench iai_callgrind_benchmark -- --callgrind-limits='ir=5%'

These command-line arguments and environment variables can be used to define soft limits and hard limits in one go with the |-operator (e.g. --callgrind-limits='ir=5%|10000') or multiple limits at once separated by a , (e.g. --callgrind-limits='ir=5%|10000,totalrw=2%').

For a list of all allowed callgrind metrics (like ir) see the docs of EventKind, for cachegrind metrics CachegrindMetric and for dhat metrics DhatMetric. It is sometimes more convenient to define limits for whole groups with the @-operator: --callgrind-metrics='@all=5%'. All allowed groups and their members for callgrind metrics can be found in CallgrindMetrics, for cachegrind metrics in CachegrindMetrics and dhat metrics in DhatMetrics.

Multiple specifications of the same EventKind, ... overwrite the previous one until the last one wins. This is useful for example to specify a limit for all event kinds and then overwrite the limit for a specific event kind: --callgrind-limits='@all=10%,ir=5%'

The format, short names and groups in full detail

For --callgrind-limits:

arg        ::= pair ("," pair)*
pair       ::= key "=" value ("|" value)*
key        ::= group | event         ; matched case-insensitive
group      ::= "@" ( "default"
                   | "all"
                   | ("cachemisses" | "misses" | "ms")
                   | ("cachemissrates" | "missrates" | "mr")
                   | ("cachehits" | "hits" | "hs")
                   | ("cachehitrates" | "hitrates" | "hr")
                   | ("cachesim" | "cs")
                   | ("cacheuse" | "cu")
                   | ("systemcalls" | "syscalls" | "sc")
                   | ("branchsim" | "bs")
                   | ("writebackbehaviour" | "writeback" | "wb")
                   )
event      ::= EventKind
value      ::= soft_limit | hard_limit
soft_limit ::= (integer | float) "%" ; can be negative
hard_limit ::= (integer | float)     ; float is only allowed for EventKinds which are
                                   ; float like `L1HitRate` but not `L1Hits`

with:

Groups with a long name have their allowed abbreviations placed in the same parentheses.
EventKind is the exact name of the enum variant (case insensitive)
integer is a u64 and float is a f64

For --cachegrind-limits replace the group and event from above with:

group ::= "@" ( "default"
              | "all"
              | ("cachemisses" | "misses" | "ms")
              | ("cachemissrates" | "missrates" | "mr")
              | ("cachehits" | "hits" | "hs")
              | ("cachehitrates" | "hitrates" | "hr")
              | ("cachesim" | "cs")
              | ("branchsim" | "bs")
              )

event ::= CachegrindMetric

For --dhat-limits replace the group and event from above with:

group ::= "@" ( "default" | "all" )
event ::= ( "totalunits" | "tun" )
          | ( "totalevents" | "tev" )
          | ( "totalbytes" | "tb" )
          | ( "totalblocks" | "tbk" )
          | ( "attgmaxbytes" | "gb" )
          | ( "attgmaxblocks" | "gbk" )
          | ( "attendbytes" | "eb" )
          | ( "attendblocks" | "ebk" )
          | ( "readsbytes" | "rb" )
          | ( "writesbytes" | "wb" )
          | ( "totallifetimes" | "tl" )
          | ( "maximumbytes" | "mb" )
          | ( "maximumblocks" | "mbk" )

Define a performance regression check in a benchmark

For example, in a Library Benchmark, define a soft limit of +5% for the Ir event kind for all benchmarks of this file:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    Callgrind, EventKind
};
use std::hint::black_box;

#[library_benchmark]
#[bench::worst_case(vec![3, 2, 1])]
fn bench_library(data: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(data))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::default()
            .soft_limits([(EventKind::Ir, 5.0)])
        );
    library_benchmark_groups = my_group
);
}

Now, if the comparison of the Ir events of the current bench_library benchmark run with the previous run results in an increase of over 5%, the benchmark fails. Running the benchmark from above the first time results in the following output:

lib_bench_regression::my_group::bench_library worst_case:vec! [3, 2, 1]
  Instructions:                         152|N/A                  (*********)
  L1 Hits:                              201|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               5|N/A                  (*********)
  Total read+write:                     206|N/A                  (*********)
  Estimated Cycles:                     376|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.14477s

Let's assume there's a change in my_lib::bubble_sort with a negative impact on the performance, then running the benchmark again results in an output something similar to this:

lib_bench_regression::my_group::bench_library worst_case:vec! [3, 2, 1]
  Instructions:                         264|152                  (+73.6842%) [+1.73684x]
  L1 Hits:                              341|201                  (+69.6517%) [+1.69652x]
  LL Hits:                                0|0                    (No change)
  RAM Hits:                               6|5                    (+20.0000%) [+1.20000x]
  Total read+write:                     347|206                  (+68.4466%) [+1.68447x]
  Estimated Cycles:                     551|376                  (+46.5426%) [+1.46543x]
Performance has regressed: Instructions (152 -> 264) regressed by +73.6842% (>+5.00000%)

Regressions:

  lib_bench_regression::my_group::bench_library:
    Instructions (152 -> 264): +73.6842% exceeds limit of +5.00000%

Iai-Callgrind result: Regressed. 0 without regressions; 1 regressed; 1 benchmarks finished in 0.14849s
error: bench failed, to rerun pass `-p benchmark-tests --bench lib_bench_regression`

Caused by:
  process didn't exit successfully: `/home/lenny/workspace/programming/iai-callgrind/target/release/deps/lib_bench_regression-98382b533bca8f56 --bench` (exit status: 3)

Which event to choose to measure performance regressions?

For callgrind/cachegrind and if in doubt, the answer is Ir (instructions executed). If Ir event counts decrease noticeable the function (binary) runs faster. The inverse statement is also true: If the Ir counts increase noticeable, there's a slowdown of the function (binary).

These statements are not so easy to transfer to Estimated Cycles, cache metrics and most of the other event counts. But, depending on the scenario and the function (binary) under test, it can be reasonable to define more regression checks.

Who actually uses instructions to measure performance?

The ones known to the author of this humble guide are

SQLite: They use mainly cpu instructions to measure performance improvements (and regressions).
Also in benchmarks of the rustc compiler and compiler-builtins, instruction counts play a great role. But, they also use cache metrics and cycles.
SpacetimeDB

If you know of others, please feel free to add them to this list.

Cachegrind: a high-precision tracing profiler

Prerequisites

In order to use Cachegrind instead of Callgrind you need valgrind version 3.22 or above installed (which you can look up with valgrind --version). In this version Valgrind introduced the two Client requests start_instrumentation() and stop_instrumentation(). In order to use client requests you need to turn them on in the Cargo.toml with the client_requests feature

[dev-dependencies]
iai-callgrind = { version = "0.16.1", features = ["client_requests"] }

The cachegrind feature

There are two ways to use cachegrind instead of callgrind. The first and easy way is to use the cachegrind feature, so your iai-callgrind spec should finally look like this:

[dev-dependencies]
iai-callgrind = { version = "0.16.1", features = ["cachegrind"] }

The cachegrind feature automatically activates the client_requests feature, and there's no need to specify it again. Now, without having to do anything else, all benchmarks run with Cachegrind instead of Callgrind. However, this change has implications which are better understood by showing the second way.

The second way

There are actually multiple second ways to run Cachegrind as default tool (see also command-line arguments) but they have the same principle in common. For example in the benchmark file run a specific benchmark function with Cachegrind:

extern crate iai_callgrind;
pub mod my_lib { pub fn bubble_sort(input: Vec<i32>) -> Vec<i32> { input } }

use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    client_requests, ValgrindTool
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .default_tool(ValgrindTool::Cachegrind)
)]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_function(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_function);
fn main() {
main!(library_benchmark_groups = my_group);
}

However, this is not enough to get correct measurements. Only choosing Cachegrind as default tool will measure everything including setup and teardown, ... For this reason we need client requests to tell Cachegrind when to start and stop the instrumentation:

extern crate iai_callgrind;
pub mod my_lib { pub fn bubble_sort(input: Vec<i32>) -> Vec<i32> { input } }

use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    ValgrindTool, client_requests, Cachegrind
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .default_tool(ValgrindTool::Cachegrind)
        .tool(Cachegrind::with_args(["--instr-at-start=no"]))
)]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_function(array: Vec<i32>) -> Vec<i32> {
    client_requests::cachegrind::start_instrumentation();
    let r = black_box(my_lib::bubble_sort(array));
    client_requests::cachegrind::stop_instrumentation();
    r
}

library_benchmark_group!(name = my_group; benchmarks = bench_function);
fn main() {
main!(library_benchmark_groups = my_group);
}

Not only the body of the benchmark function changed but also the command-line argument --instr-at-start=no had to be specified in order to start the instrumentation with the client request and not (what is the default) when starting the benchmark executable.

All of the above is exactly what the cachegrind feature does. It adds the client requests to the function body, returns the result from the function and start cachegrind with --instr-at-start=no. The consequence and a disadvantage of cachegrind is that the function body had to be altered a little bit. It's not much but running other tools tools like Callgrind on the same benchmark function like Cachegrind would show small differences because the client requests add 10 - 20 instructions to the function body.

When to use Cachegrind

As shown above, running Cachegrind can have disadvantages but there are circumstances under which it is better to use Cachegrind. Here's a comparison of both tools:

Cachegrind	Callgrind
Works on all platforms	Callgrind's ability to detect function calls and returns depends on the instruction set of the platform it is run on. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC, ARM, Thumb or MIPS code. This is because there are no explicit call or return instructions in these instruction sets, so Callgrind has to rely on heuristics to detect calls and returns
Bigger tool set: `cg_diff`, `cg_merge` and `cg_annotate`	Just `callgrind_annotate`
Smaller functionality which shows in a far less amount of command-line arguments	Greater functionality (`--toggle-collect`, ...)
Smaller amount of profile data and metrics	More metrics (`--collect-bus`, ...)
Client requests add a small amount of build time and have more prerequisites	No need for client requests and no alteration of the benchmark function body is required which makes it more intuitive to use

DHAT: a dynamic heap analysis tool

Intro to DHAT

To fully understand DHAT please read the Valgrind docs of DHAT. Here's just a short summary and quote from the docs:

DHAT is primarily a tool for examining how programs use their heap allocations. It tracks the allocated blocks, and inspects every memory access to find which block, if any, it is to. It presents, on a program point basis, information about these blocks such as sizes, lifetimes, numbers of reads and writes, and read and write patterns.

The rest of this chapter is dedicated to how DHAT is integrated into Iai-Callgrind.

The DHAT modes

Iai-Callgrind supports all three modes heap (the default), copy and ad-hoc which can be changed on the command-line with --dhat-args=--mode=ad-hoc or in the benchmark itself with Dhat::args. Note that ad-hoc mode requires client requests which have prerequisites. If running the benchmarks in ad-hoc mode, it is highly recommended to turn off the EntryPoint with EntryPoint::None (See next section). However, DHAT is normally run in heap mode and it is assumed that this is the mode used in the next sections.

The default entry point

The DHAT default entry point EntryPoint::Default in library benchmarks behaves similar to Callgrind's EntryPoint. This centers the collected metrics shown in the terminal output around the benchmark function. The entry point is set to EntryPoint::None for binary benchmarks. But, if necessary, the entry point can be turned off or customized in Dhat::entry_point.

In contrast to callgrind's entry point, the DHAT default entry point includes the metrics of setup and/or teardown code or anything specified in the args parameter of the #[bench] or #[benches] attribute. This is a limitation of DHAT and what is possible to reliably extract from the output files. Callgrind has a command-line flag --toggle-collect to toggle collection on and off. DHAT doesn't have such an option, and the sanitization of metrics can only be realized afterwards based on the DHAT output files. However, this works well enough to stabilize the metrics so they exclude the metrics of Iai-Callgrind allocations (around 2000 - 2500 bytes) in the main function needed to setup the benchmark.

Note that setting an entry point or Dhat::frames does not alter the dhat output files in any way.

Usage on the command-line

Running DHAT instead of or in addition to Callgrind is pretty straight-forward and not different to any other tool:

Either use command-line arguments or environment variables: --default-tool=dhat or IAI_CALLGRIND_DEFAULT_TOOL=dhat (replaces callgrind as default tool) or --tools=dhat or IAI_CALLGRIND_TOOLS=dhat (runs DHAT in addition to the default tool).

Usage in a benchmark and a small example analysis

Running DHAT in addition to Callgrind can also be carried out in the benchmark itself with the Dhat struct in LibraryBenchmarkConfig::tool. Here, globally in the main! macro:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    Dhat
};
use std::hint::black_box;

#[library_benchmark]
#[bench::worst_case_3(vec![3, 2, 1])]
fn bench_bubble_sort(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Dhat::default());
    library_benchmark_groups = my_group
);
}

The above benchmark will produce the following metrics:

lib_bench_dhat::my_group::bench_library worst_case_3:vec! [3, 2, 1]
  ======= CALLGRIND ====================================================================
  Instructions:                          83|N/A                  (*********)
  L1 Hits:                              110|N/A                  (*********)
  LL Hits:                                0|N/A                  (*********)
  RAM Hits:                               3|N/A                  (*********)
  Total read+write:                     113|N/A                  (*********)
  Estimated Cycles:                     215|N/A                  (*********)
  ======= DHAT =========================================================================
  Total bytes:                           12|N/A                  (*********)
  Total blocks:                           1|N/A                  (*********)
  At t-gmax bytes:                        0|N/A                  (*********)
  At t-gmax blocks:                       0|N/A                  (*********)
  At t-end bytes:                         0|N/A                  (*********)
  At t-end blocks:                        0|N/A                  (*********)
  Reads bytes:                           24|N/A                  (*********)
  Writes bytes:                          36|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.55554s

Analyzing the DHAT data, there are a total of 12 bytes of allocations (The vector: 3 * sizeof(i32) bytes = 3 * 4 bytes) in 1 block during the setup of the benchmark. That's also 12 bytes of writes to fill the vector with the values. That makes 24 bytes of reads and 24 bytes of writes in the bubble_sort function. Also, there are no (de-)allocations of heap memory in bubble_sort itself.

Soft limits and hard limits

Based on that data, we could define for example hard limits (or soft limits or both whatever you think is appropriate) to ensure bubble_sort is not getting worse than that.

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    Dhat, DhatMetric
};
use std::hint::black_box;

#[library_benchmark]
#[bench::worst_case_3(
    args = (vec![3, 2, 1]),
    config = LibraryBenchmarkConfig::default()
        .tool(Dhat::default()
            .hard_limits([
                (DhatMetric::ReadsBytes, 24),
                (DhatMetric::WritesBytes, 32)
            ])
        )
)]
fn bench_bubble_sort(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Dhat::default());
    library_benchmark_groups = my_group
);
}

Now, if bubble_sort would read more than 24 bytes or if there were more than 32 bytes of writes during the benchmark, the benchmark would fail and exit with error.

Frames and benchmarking multi-threaded functions

It is possible to specify additional Dhat::frames for example when benchmarking multi-threaded functions. Like in callgrind, each thread/subprocess in DHAT is treated as a separate unit and thus requires frames (the Iai-Callgrind specific approximation of callgrind toggles) in addition to the default entry point to include the interesting ones in the measurements.

By example. Suppose there's a function in the benchmark_tests library find_primes_multi_thread(num_threads: usize) which searches for primes in the range 0 - 10000 * num_threads. This multi-threaded function is splitting the work for each 10000 numbers into a separate thread each calling the single-threaded function benchmark_tests::find_primes which does the actual work. The inner workings aren't important but this description should be enough to understand the basic idea.

extern crate iai_callgrind;
mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } }
use std::hint::black_box;
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    ValgrindTool,
};

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .default_tool(ValgrindTool::DHAT)
)]
fn bench_library() -> Vec<u64> {
    black_box(benchmark_tests::find_primes_multi_thread(black_box(1)))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}

Running the benchmark produces the following output:

lib_bench_find_primes::my_group::bench_library
  ======= DHAT =========================================================================
  Total bytes:                        11456|11456                (No change)
  Total blocks:                           9|9                    (No change)
  At t-gmax bytes:                    10264|10264                (No change)
  At t-gmax blocks:                       4|4                    (No change)
  At t-end bytes:                         0|0                    (No change)
  At t-end blocks:                        0|0                    (No change)
  Reads bytes:                          776|776                  (No change)
  Writes bytes:                       10329|10329                (No change)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.44534s

The problem here is, that the spawned thread is not included in the metrics. Looking at the output files of the dhat output in dh_view.html (heavily shortened to safe some space):

Invocation {
  Mode:    heap
  Command: /home/some/project/target/release/deps/lib_bench_find_primes-c304b7c3fed25785 --iai-run my_group 0 0 lib_bench_find_primes::my_group::bench_library
  PID:     212817
}

Times {
  t-gmax: 2,825,042 instrs (99.57% of program duration)
  t-end:  2,837,309 instrs
}

▼ PP 1/1 (3 children) {
    Total:     46,827 bytes (100%, 16,504.02/Minstr) in 37 blocks (100%, 13.04/Minstr), avg size 1,265.59 bytes, avg lifetime 840,789.86 instrs (29.63% of program duration)
    At t-gmax: 26,847 bytes (100%) in 9 blocks (100%), avg size 2,983 bytes
    At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
    Reads:     45,876 bytes (100%, 16,168.84/Minstr), 0.98/byte
    Writes:    48,285 bytes (100%, 17,017.89/Minstr), 1.03/byte
    Allocated at {
      #0: [root]
    }
  }
  ├─▼ PP 1.1/3 (12 children) {
  │     Total:     46,027 bytes (98.29%, 16,222.06/Minstr) in 28 blocks (75.68%, 9.87/Minstr), avg size 1,643.82 bytes, avg lifetime 858,562.71 instrs (30.26% of program duration)
  │     At t-gmax: 26,511 bytes (98.75%) in 7 blocks (77.78%), avg size 3,787.29 bytes
  │     At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  │     Reads:     45,412 bytes (98.99%, 16,005.31/Minstr), 0.99/byte
  │     Writes:    47,925 bytes (99.25%, 16,891/Minstr), 1.04/byte
  │     Allocated at {
  │       #1: 0x48C57A8: malloc (in /usr/lib/valgrind/vgpreload_dhat-amd64-linux.so)
  │     }
  │   }
  │   ├── PP 1.1.1/12 {
  │   │     Total:     32,736 bytes (69.91%, 11,537.69/Minstr) in 10 blocks (27.03%, 3.52/Minstr), avg size 3,273.6 bytes, avg lifetime 235,111.9 instrs (8.29% of program duration)
  │   │     Max:       16,384 bytes in 1 blocks, avg size 16,384 bytes
  │   │     At t-gmax: 16,384 bytes (61.03%) in 1 blocks (11.11%), avg size 16,384 bytes
  │   │     At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  │   │     Reads:     26,184 bytes (57.08%, 9,228.46/Minstr), 0.8/byte
  │   │     Writes:    26,184 bytes (54.23%, 9,228.46/Minstr), 0.8/byte
  │   │     Allocated at {
  │   │       ^1: 0x48C57A8: malloc (in /usr/lib/valgrind/vgpreload_dhat-amd64-linux.so)
  │   │       #2: 0x40153B7: UnknownInlinedFun (alloc.rs:93)
  │   │       #3: 0x40153B7: UnknownInlinedFun (alloc.rs:188)
  │   │       #4: 0x40153B7: UnknownInlinedFun (alloc.rs:249)
  │   │       #5: 0x40153B7: UnknownInlinedFun (mod.rs:476)
  │   │       #6: 0x40153B7: with_capacity_in<alloc::alloc::Global> (mod.rs:422)
  │   │       #7: 0x40153B7: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:190)
  │   │       #8: 0x40153B7: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:815)
  │   │       #9: 0x40153B7: with_capacity<u64> (mod.rs:495)
  │   │       #10: 0x40153B7: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (spec_from_iter_nested.rs:31)
  │   │       #11: 0x40153B7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter (spec_from_iter.rs:34)
  │   │       #12: 0x4013B67: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (mod.rs:3438)
  │   │       #13: 0x4013B67: collect<core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>, alloc::vec::Vec<u64, alloc::alloc::Global>> (iterator.rs:2001)
  │   │       #14: 0x4013B67: benchmark_tests::find_primes (lib.rs:25)
  │   │       #15: 0x4015800: {closure#0} (lib.rs:32)
  │   │       #16: 0x4015800: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:152)
  │   │       #17: 0x4014824: {closure#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>> (mod.rs:559)
  │   │       #18: 0x4014824: call_once<alloc::vec::Vec<u64, alloc::alloc::Global>, std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>> (unwind_safe.rs:272)
  │   │       #19: 0x4014824: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panicking.rs:589)
  │   │       #20: 0x4014824: try<alloc::vec::Vec<u64, alloc::alloc::Global>, core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>>> (panicking.rs:552)
  │   │       #21: 0x4014824: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panic.rs:359)
  │   │       #22: 0x4014824: {closure#1}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>> (mod.rs:557)
  │   │       #23: 0x4014824: core::ops::function::FnOnce::call_once{{vtable.shim}} (function.rs:250)
  │   │       #24: 0x404460A: call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> (boxed.rs:1966)
  │   │       #25: 0x404460A: call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> (boxed.rs:1966)
  │   │       #26: 0x404460A: std::sys::pal::unix::thread::Thread::new::thread_start (thread.rs:97)
  │   │       #27: 0x49BB7EA: ??? (in /usr/lib/libc.so.6)
  │   │       #28: 0x4A3EFB3: clone (in /usr/lib/libc.so.6)
  │   │     }
  │   │   }
  ...

The missing metrics of the thread are caused by the default entry point which only includes the program points with the benchmark function in their call stack. But, looking closely at the program point PP 1.1.1/12 and the call stack, there's no frame (function call) of the benchmark function bench_library or a main function. As mentioned earlier, this is because the thread is completely separated by DHAT.

There are multiple ways to go on depending on what we want to measure. To show two different approaches, at first, I'll go with measuring the benchmark function with the function spawning the threads (the default entry point which doesn't have to be specified) and additionally all threads which execute the benchmark_tests::find_primes function.

extern crate iai_callgrind;
mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } }
use std::hint::black_box;
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    ValgrindTool, Dhat
};

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .default_tool(ValgrindTool::DHAT)
        .tool(Dhat::default()
            .frames(["benchmark_tests::find_primes"])
        )
)]
fn bench_library() -> Vec<u64> {
    black_box(benchmark_tests::find_primes_multi_thread(black_box(1)))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}

Now, the metrics include the spawned thread(s):

lib_bench_find_primes::my_group::bench_library
  ======= DHAT =========================================================================
  Total bytes:                        44192|N/A                  (*********)
  Total blocks:                          19|N/A                  (*********)
  At t-gmax bytes:                    26648|N/A                  (*********)
  At t-gmax blocks:                       5|N/A                  (*********)
  At t-end bytes:                         0|N/A                  (*********)
  At t-end blocks:                        0|N/A                  (*********)
  Reads bytes:                        26960|N/A                  (*********)
  Writes bytes:                       36513|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.44273s

If we were only interested in the threads itself, then using EntryPoint::Custom would be one way to do it. Setting a custom entry point is sugar for disabling the entry point with EntryPoint::None and specifying a frame with Dhat::frames:

extern crate iai_callgrind;
mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } }
use std::hint::black_box;
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    ValgrindTool, Dhat, EntryPoint
};

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .default_tool(ValgrindTool::DHAT)
        .tool(Dhat::default()
            .entry_point(
                EntryPoint::Custom("benchmark_tests::find_primes".to_owned())
            )
        )
)]
fn bench_library() -> Vec<u64> {
    black_box(benchmark_tests::find_primes_multi_thread(black_box(1)))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}

Running this benchmark results in:

lib_bench_find_primes::my_group::bench_library
  ======= DHAT =========================================================================
  Total bytes:                        32736|N/A                  (*********)
  Total blocks:                          10|N/A                  (*********)
  At t-gmax bytes:                    16384|N/A                  (*********)
  At t-gmax blocks:                       1|N/A                  (*********)
  At t-end bytes:                         0|N/A                  (*********)
  At t-end blocks:                        0|N/A                  (*********)
  Reads bytes:                        26184|N/A                  (*********)
  Writes bytes:                       26184|N/A                  (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.45178s

To verify our setup, let's compare these numbers with the data of the program point with the thread of the dh_view.html output shown above. Eventually, these are the same metrics:

  │   ├── PP 1.1.1/12 {
  │   │     Total:     32,736 bytes (69.91%, 11,537.69/Minstr) in 10 blocks (27.03%, 3.52/Minstr), avg size 3,273.6 bytes, avg lifetime 235,111.9 instrs (8.29% of program duration)
  │   │     Max:       16,384 bytes in 1 blocks, avg size 16,384 bytes
  │   │     At t-gmax: 16,384 bytes (61.03%) in 1 blocks (11.11%), avg size 16,384 bytes
  │   │     At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  │   │     Reads:     26,184 bytes (57.08%, 9,228.46/Minstr), 0.8/byte
  │   │     Writes:    26,184 bytes (54.23%, 9,228.46/Minstr), 0.8/byte

Other Valgrind Tools

In addition to or instead of the default tool Callgrind, you can use the Iai-Callgrind framework to run other Valgrind profiling tools like DHAT, Massif or even Cachegrind and the experimental BBV but also error checking tools like Memcheck, Helgrind and DRD.

Note that support for Massif or BBV is currently only basic and doesn't show useful stats and metrics in the terminal output of Iai-Callgrind. But, the output files are generated as usual and are ready to be examined with tools like ms_print.

See also the Valgrind User Manual for all the details about each tool and their command line arguments.

Running other Valgrind tools

It's possible to change the default tool Callgrind to any other valgrind tool with the command-line argument --default-tool=<tool> or environment variable IAI_CALLGRIND_DEFAULT_TOOL=<tool>. <tool> may be one of callgrind, cachegrind, dhat, massif, memcheck, helgrind, drd, exp-bbv.

Running tools in addition to the default tool can be achieved with --tools=<tools> or IAI_CALLGRIND_TOOLS=<tools> where <tools> is a ,-separated list of one or more of the <tool> above.

The tool configurations can be changed in the benchmark file by specifying the structs Callgrind, Cachegrind, ..., Bbv in LibraryBenchmarkConfig::tool or BinaryBenchmarkConfig::tool.

Note that it is fully sufficient to specify a configuration to actually run the tool. For example to run DHAT with its default configuration for all library benchmarks in the same file in addition to Callgrind without the need for --tools=dhat:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    ValgrindTool, Dhat
};
use std::hint::black_box;

#[library_benchmark]
fn bench_library() -> Vec<i32> {
    black_box(my_lib::bubble_sort(vec![3, 2, 1]))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Dhat::default());
    library_benchmark_groups = my_group
);
}

All tools which produce an ERROR SUMMARY (Memcheck, DRD, Helgrind) have --error-exitcode=201 set, so if there are any errors, the benchmark run fails with 201. You can overwrite this default for example for Memcheck

#![allow(unused)]
fn main() {
extern crate iai_callgrind;
use iai_callgrind::{Memcheck, ValgrindTool};

Memcheck::with_args(["--error-exitcode=0"]);
}

which would restore the default of 0 from valgrind.

Valgrind Client Requests

Iai-Callgrind ships with its own interface to the Valgrind's Client Request Mechanism. Iai-Callgrind's client requests have zero overhead (relative to the "C" implementation of Valgrind) on many targets which are also natively supported by valgrind. In short, Iai-Callgrind provides a complete and performant implementation of Valgrind Client Requests.

Installation

Client requests are deactivated by default but can be activated with the client_requests feature.

[dev-dependencies]
iai-callgrind = { version = "0.16.1", features = ["client_requests"] }

If you need the client requests in your production code, you don't want them to do anything when not running under valgrind with Iai-Callgrind benchmarks. You can achieve that by adding Iai-Callgrind with the client_requests_defs feature to your runtime dependencies and with the client_requests feature to your dev-dependencies like so:

[dependencies]
iai-callgrind = { version = "0.16.1", default-features = false, features = [
    "client_requests_defs"
] }

[dev-dependencies]
iai-callgrind = { version = "0.16.1", features = ["client_requests"] }

With just the client_requests_defs feature activated, the client requests compile down to nothing and don't add any overhead to your production code. It simply provides the "definitions", method signatures and macros without body. Only with the activated client_requests feature they will be actually executed. Note that the client requests do not depend on any other part of Iai-Callgrind, so you could even use the client requests without the rest of Iai-Callgrind.

When building Iai-Callgrind with client requests, the valgrind header files must exist in your standard include path (most of the time /usr/include). This is usually the case if you've installed valgrind with your distribution's package manager. If not, you can point the IAI_CALLGRIND_VALGRIND_INCLUDE or IAI_CALLGRIND_<triple>_VALGRIND_INCLUDE environment variables to the include path. So, if the headers can be found in /home/foo/repo/valgrind/{valgrind.h, callgrind.h, ...}, the correct include path would be IAI_CALLGRIND_VALGRIND_INCLUDE=/home/foo/repo (not /home/foo/repo/valgrind)

Usage

Use them in your code for example like so:

extern crate iai_callgrind;
use iai_callgrind::client_requests;

fn main() {
fn main() {
    // Start callgrind event counting if not already started earlier
    client_requests::callgrind::start_instrumentation();

    // do something important

    // Switch event counting off
    client_requests::callgrind::stop_instrumentation();
}
}

Library Benchmarks

In library benchmarks you might need to use EntryPoint::None in order to make the client requests work as expected:

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

pub mod my_lib {
     #[inline(never)]
     fn bubble_sort(input: Vec<i32>) -> Vec<i32> {
         // The algorithm
       input
     }

     pub fn pre_bubble_sort(input: Vec<i32>) -> Vec<i32> {
         println!("Doing something before the function call");
         iai_callgrind::client_requests::callgrind::start_instrumentation();

         let result = bubble_sort(input);

         iai_callgrind::client_requests::callgrind::stop_instrumentation();
         result
     }
}

#[library_benchmark]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_function(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::pre_bubble_sort(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_function);
fn main() {
main!(library_benchmark_groups = my_group);
}

The default EntryPoint sets the --toggle-collect to the benchmark function (here bench_function) and --collect-at-start=no. So, Callgrind starts collecting the events when entering the benchmark function, not the moment start_instrumentation is called. This behaviour can be remedied with EntryPoint::None:

extern crate iai_callgrind;
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    client_requests, EntryPoint, Callgrind
};
use std::hint::black_box;

pub mod my_lib {
     #[inline(never)]
     fn bubble_sort(input: Vec<i32>) -> Vec<i32> {
         // The algorithm
       input
     }

     pub fn pre_bubble_sort(input: Vec<i32>) -> Vec<i32> {
         println!("Doing something before the function call");
         iai_callgrind::client_requests::callgrind::start_instrumentation();

         let result = bubble_sort(input);

         iai_callgrind::client_requests::callgrind::stop_instrumentation();
         result
     }
}

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::with_args(["--collect-at-start=no"])
            .entry_point(EntryPoint::None)
        )
)]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_function(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::pre_bubble_sort(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_function);
fn main() {
main!(library_benchmark_groups = my_group);
}

As the standard toggle is now switched off and the option --collect-at-start=no is also omitted, you must specify --collect-at-start=no manually in LibraryBenchmarkConfig::raw_callgrind_args.

Please see the docs for more details!

Callgrind Flamegraphs

Flamegraphs are opt-in and can be created if you pass a FlamegraphConfig to the BinaryBenchmarkConfig or LibraryBenchmarkConfig. Callgrind flamegraphs are meant as a complement to valgrind's visualization tools callgrind_annotate and kcachegrind.

For example create all kind of flamegraphs for all benchmarks in a library benchmark:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    FlamegraphConfig, Callgrind
};
use std::hint::black_box;

#[library_benchmark]
fn bench_library() -> Vec<i32> {
    black_box(my_lib::bubble_sort(vec![3, 2, 1]))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Callgrind::default()
            .flamegraph(FlamegraphConfig::default())
        );
    library_benchmark_groups = my_group
);
}

The produced flamegraph *.svg files are located next to the respective callgrind output file in the target/iai directory.

Regular Flamegraphs

Regular callgrind flamegraphs show the inclusive costs for functions and a single EventKind (default is EventKind::Ir), similar to callgrind_annotate. Suppose the example from above is stored in a benchmark iai_callgrind_benchmark:

Regular Flamegraph

If you open this image in a new tab, you can play around with the svg.

Differential Flamegraphs

Differential flamegraphs facilitate a deeper understanding of code sections which cause a bottleneck or a performance regressions etc.

Differential Flamegraph

We simulated a small change in bubble_sort and in the differential flamegraph you can spot fairly easily where the increase of Instructions is happening.

(Experimental) Create flamegraphs for multi-threaded/multi-process benchmarks

Note the following only affects flamegraphs of multi-threaded/multi-process benchmarks and benchmarks which produce multiple parts with a total over all sub-metrics.

Currently, Iai-Callgrind creates the flamegraphs only for the total over all threads/parts and subprocesses. This leads to complications since the call graph is not be fully recovered just by examining each thread/subprocess separately. So, the total metrics in the flamegraphs might not be the same as the total metrics shown in the terminal output. If in doubt, the terminal output shows the the correct metrics.

Basic usage

It's possible to pass arguments to Iai-Callgrind separated by -- (cargo bench -- ARGS). If you're running into the error Unrecognized Option, see Troubleshooting. For a complete rundown of possible arguments, execute cargo bench --bench <benchmark> -- --help. Almost all command-line arguments have a corresponding environment variable. The environment variables which don't have a corresponding command-line argument are:

IAI_CALLGRIND_COLOR: Control the colored output of Iai-Callgrind (Default is auto)
IAI_CALLGRIND_LOG: Define the log level (Default is WARN)

Exit Codes

0: Success
1: All other errors
2: Parsing command-line arguments failed
3: One or more regressions occurred

The command-line arguments

For an update-to-date list run cargo bench with --help as described above.

High-precision and consistent benchmarking framework/harness for Rust

Boolish command line arguments take also one of `y`, `yes`, `t`, `true`, `on`, `1`
instead of `true` and one of `n`, `no`, `f`, `false`, `off`, and `0` instead of
`false`

Usage: cargo bench ... [BENCHNAME] -- [OPTIONS]

Arguments:
  [BENCHNAME]
          If specified, only run benches containing this string in their names

          Note that a benchmark name might differ from the benchmark file name.

          [env: IAI_CALLGRIND_FILTER=]

Options:
      --list[=<LIST>]
          Print a list of all benchmarks. With this argument no benchmarks are executed.

          The output format is intended to be the same as the output format of the libtest harness.
          However, future changes of the output format by cargo might not be incorporated into
          iai-callgrind. As a consequence, it is not considered safe to rely on the output in
          scripts.

          [env: IAI_CALLGRIND_LIST=]
          [default: false]
          [possible values: true, false]

      --default-tool <DEFAULT_TOOL>
          The default tool used to run the benchmarks

          The standard tool to run the benchmarks is callgrind but can be overridden with this
          option. Any valgrind tool can be used:
            * callgrind
            * cachegrind
            * dhat
            * memcheck
            * helgrind
            * drd
            * massif
            * exp-bbv

          This argument matches the tool case-insensitive. Note that using cachegrind with this
          option to benchmark library functions needs adjustments to the benchmarking functions with
          client-requests to measure the counts correctly. If you want to switch permanently to
          cachegrind, it is usually better to activate the `cachegrind` feature of iai-callgrind in
          your Cargo.toml. However, setting a tool with this option overrides cachegrind set with the
          iai-callgrind feature. See the guide for all details.

          [env: IAI_CALLGRIND_DEFAULT_TOOL=]

      --tools <TOOLS>...
          A comma separated list of tools to run additionally to callgrind or another default tool

          The tools specified here take precedence over the tools in the benchmarks. The valgrind
          tools which are allowed here are the same as the ones listed in the documentation of
          --default-tool.

          Examples
            * --tools dhat
            * --tools memcheck,drd

          [env: IAI_CALLGRIND_TOOLS=]

      --allow-aslr[=<ALLOW_ASLR>]
          Allow ASLR (Address Space Layout Randomization)

          If possible, ASLR is disabled on platforms that support it (linux, freebsd) because ASLR
          could noise up the callgrind cache simulation results a bit. Setting this option to true
          runs all benchmarks with ASLR enabled.

          See also
          <https://docs.kernel.org/admin-guide/sysctl/kernel.html?highlight=randomize_va_space#randomize-va-space>

          [env: IAI_CALLGRIND_ALLOW_ASLR=]
          [possible values: true, false]

      --home <HOME>
          Specify the home directory of iai-callgrind benchmark output files

          All output files are per default stored under the `$PROJECT_ROOT/target/iai` directory.
          This option lets you customize this home directory, and it will be created if it doesn't
          exist.

          [env: IAI_CALLGRIND_HOME=]

      --separate-targets[=<SEPARATE_TARGETS>]
          Separate iai-callgrind benchmark output files by target

          The default output path for files created by iai-callgrind and valgrind during the
          benchmark is

          `target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID`.

          This can be problematic if you're running the benchmarks not only for a single target
          because you end up comparing the benchmark runs with the wrong targets. Setting this option
          changes the default output path to

          `target/iai/$TARGET/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID`

          Although not as comfortable and strict, you could achieve a separation by target also with
          baselines and a combination of `--save-baseline=$TARGET` and `--baseline=$TARGET` if you
          prefer having all files of a single $BENCH in the same directory.

          [env: IAI_CALLGRIND_SEPARATE_TARGETS=]
          [default: false]
          [possible values: true, false]

      --baseline[=<BASELINE>]
          Compare against this baseline if present but do not overwrite it

          [env: IAI_CALLGRIND_BASELINE=]

      --load-baseline[=<LOAD_BASELINE>]
          Load this baseline as the new data set instead of creating a new one

          [env: IAI_CALLGRIND_LOAD_BASELINE=]

      --save-baseline[=<SAVE_BASELINE>]
          Compare against this baseline if present and then overwrite it

          [env: IAI_CALLGRIND_SAVE_BASELINE=]

      --nocapture[=<NOCAPTURE>]
          Don't capture terminal output of benchmarks

          Possible values are one of [true, false, stdout, stderr].

          This option is currently restricted to the `callgrind` run of benchmarks. The output of
          additional tool runs like DHAT, Memcheck, ... is still captured, to prevent showing the
          same output of benchmarks multiple times. Use `IAI_CALLGRIND_LOG=info` to also show
          captured and logged output.

          If no value is given, the default missing value is `true` and doesn't capture stdout and
          stderr. Besides `true` or `false` you can specify the special values `stdout` or `stderr`.
          If `--nocapture=stdout` is given, the output to `stdout` won't be captured and the output
          to `stderr` will be discarded. Likewise, if `--nocapture=stderr` is specified, the output
          to `stderr` won't be captured and the output to `stdout` will be discarded.

          [env: IAI_CALLGRIND_NOCAPTURE=]
          [default: false]

      --nosummary[=<NOSUMMARY>]
          Suppress the summary showing regressions and execution time at the end of a benchmark run

          Note, that a summary is only printed if the `--output-format` is not JSON.

          The summary described by `--nosummary` is different from `--save-summary` and they do not
          affect each other.

          [env: IAI_CALLGRIND_NOSUMMARY=]
          [default: false]
          [possible values: true, false]

      --output-format <OUTPUT_FORMAT>
          The terminal output format in default human-readable format or in machine-readable json
          format

          # The JSON Output Format

          The json terminal output schema is the same as the schema with the `--save-summary`
          argument when saving to a `summary.json` file. All other output than the json output goes
          to stderr and only the summary output goes to stdout. When not printing pretty json, each
          line is a dictionary summarizing a single benchmark. You can combine all lines (benchmarks)
          into an array for example with `jq`

          `cargo bench -- --output-format=json | jq -s`

          which transforms `{...}\n{...}` into `[{...},{...}]`

          [env: IAI_CALLGRIND_OUTPUT_FORMAT=]
          [default: default]

          Possible values:
          - default:     The default terminal output
          - json:        Json terminal output
          - pretty-json: Pretty json terminal output

      --save-summary[=<SAVE_SUMMARY>]
          Save a machine-readable summary of each benchmark run in json format next to the usual
          benchmark output

          [env: IAI_CALLGRIND_SAVE_SUMMARY=]

          Possible values:
          - json:        The format in a space optimal json representation without newlines
          - pretty-json: The format in pretty printed json

      --tolerance[=<TOLERANCE>]
          Show changes only when they are above the `tolerance` level

          If no value is specified, the default value of `0.000_009_999_999_999_999_999` is based on
          the number of decimal places of the percentages displayed in the terminal output in case of
          differences.

          Negative tolerance values are converted to their absolute value.

          Examples:
          * --tolerance (applies the default value)
          * --tolerance=0.1 (set the tolerance level to `0.1`)

          [env: IAI_CALLGRIND_TOLERANCE=]

      --bbv-args <BBV_ARGS>
          The command-line arguments to pass through to the experimental BBV

          <https://valgrind.org/docs/manual/bbv-manual.html#bbv-manual.usage>. See also the
          description for --callgrind-args for more details and restrictions.

          Examples:
            * --bbv-args=--interval-size=10000
            * --bbv-args='--interval-size=10000 --instr-count-only=yes'

          [env: IAI_CALLGRIND_BBV_ARGS=]

      --cachegrind-args <CACHEGRIND_ARGS>
          The command-line arguments to pass through to Cachegrind

          <https://valgrind.org/docs/manual/cg-manual.html#cg-manual.cgopts>. See also the
          description for --callgrind-args for more details and restrictions.

          Examples:
            * --cachegrind-args=--intr-at-start=no
            * --cachegrind-args='--branch-sim=yes --instr-at-start=no'

          [env: IAI_CALLGRIND_CACHEGRIND_ARGS=]

      --callgrind-args <CALLGRIND_ARGS>
          The command-line arguments to pass through to Callgrind

          <https://valgrind.org/docs/manual/cl-manual.html#cl-manual.options> and the core valgrind
          command-line arguments
          <https://valgrind.org/docs/manual/manual-core.html#manual-core.options>. Note that not all
          command-line arguments are supported especially the ones which change output paths.
          Unsupported arguments will be ignored printing a warning.

          Examples:
            * --callgrind-args=--dump-instr=yes
            * --callgrind-args='--dump-instr=yes --collect-systime=yes'

          [env: IAI_CALLGRIND_CALLGRIND_ARGS=]

      --dhat-args <DHAT_ARGS>
          The command-line arguments to pass through to DHAT

          <https://valgrind.org/docs/manual/dh-manual.html#dh-manual.options>. See also the
          description for --callgrind-args for more details and restrictions.

          Examples:
            * --dhat-args=--mode=ad-hoc

          [env: IAI_CALLGRIND_DHAT_ARGS=]

      --drd-args <DRD_ARGS>
          The command-line arguments to pass through to DRD

          <https://valgrind.org/docs/manual/drd-manual.html#drd-manual.options>. See also the
          description for --callgrind-args for more details and restrictions.

          Examples:
            * --drd-args=--exclusive-threshold=100
            * --drd-args='--exclusive-threshold=100 --free-is-write=yes'

          [env: IAI_CALLGRIND_DRD_ARGS=]

      --helgrind-args <HELGRIND_ARGS>
          The command-line arguments to pass through to Helgrind

          <https://valgrind.org/docs/manual/hg-manual.html#hg-manual.options>. See also the
          description for --callgrind-args for more details and restrictions.

          Examples:
            * --helgrind-args=--free-is-write=yes
            * --helgrind-args='--conflict-cache-size=100000 --free-is-write=yes'

          [env: IAI_CALLGRIND_HELGRIND_ARGS=]

      --massif-args <MASSIF_ARGS>
          The command-line arguments to pass through to Massif

          <https://valgrind.org/docs/manual/ms-manual.html#ms-manual.options>. See also the
          description for --callgrind-args for more details and restrictions.

          Examples:
            * --massif-args=--heap=no
            * --massif-args='--heap=no --threshold=2.0'

          [env: IAI_CALLGRIND_MASSIF_ARGS=]

      --memcheck-args <MEMCHECK_ARGS>
          The command-line arguments to pass through to Memcheck

          <https://valgrind.org/docs/manual/mc-manual.html#mc-manual.options>. See also the
          description for --callgrind-args for more details and restrictions.

          Examples:
            * --memcheck-args=--leak-check=full
            * --memcheck-args='--leak-check=yes --show-leak-kinds=all'

          [env: IAI_CALLGRIND_MEMCHECK_ARGS=]

      --valgrind-args <VALGRIND_ARGS>
          The command-line arguments to pass through to all tools

          The core valgrind command-line arguments
          <https://valgrind.org/docs/manual/manual-core.html#manual-core.options> which are
          recognized by all tools. More specific arguments for example set with --callgrind-args
          override the arguments with the same name specified with this option.

          Examples:
            * --valgrind-args=--time-stamp=yes
            * --valgrind-args='--error-exitcode=202 --num-callers=50'

          [env: IAI_CALLGRIND_VALGRIND_ARGS=]

      --cachegrind-limits <CACHEGRIND_LIMITS>
          Set performance regression limits for specific cachegrind metrics

          This is a `,` separate list of CachegrindMetric=limit or CachegrindMetrics=limit
          (key=value) pairs. See the description of --callgrind-limits for the details and
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.CachegrindMetrics.html>
          respectively
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.CachegrindMetric.html>
          for valid metrics and group members.

          See the the guide
          (https://iai-callgrind.github.io/iai-callgrind/latest/html/regressions.html) for all
          details or replace the format spec in `--callgrind-limits` with the following:

          group ::= "@" ( "default"
                        | "all"
                        | ("cachemisses" | "misses" | "ms")
                        | ("cachemissrates" | "missrates" | "mr")
                        | ("cachehits" | "hits" | "hs")
                        | ("cachehitrates" | "hitrates" | "hr")
                        | ("cachesim" | "cs")
                        | ("branchsim" | "bs")
                        )
          event ::= CachegrindMetric

          Examples:
          * --cachegrind-limits='ir=0.0%'
          * --cachegrind-limits='ir=10000,EstimatedCycles=10%'
          * --cachegrind-limits='@all=10%,ir=10000,EstimatedCycles=10%'

          [env: IAI_CALLGRIND_CACHEGRIND_LIMITS=]

      --callgrind-limits <CALLGRIND_LIMITS>
          Set performance regression limits for specific `EventKinds`

          This is a `,` separate list of EventKind=limit or CallgrindMetrics=limit (key=value) pairs
          with the limit being a soft limit if the number suffixed with a `%` or a hard limit if it
          is a bare number. It is possible to specify hard and soft limits in one go with the `|`
          operator (e.g. `ir=10%|10000`). Groups (CallgrindMetrics) are prefixed with `@`. List of
          allowed groups and events with their abbreviations:

          group ::= "@" ( "default"
                        | "all"
                        | ("cachemisses" | "misses" | "ms")
                        | ("cachemissrates" | "missrates" | "mr")
                        | ("cachehits" | "hits" | "hs")
                        | ("cachehitrates" | "hitrates" | "hr")
                        | ("cachesim" | "cs")
                        | ("cacheuse" | "cu")
                        | ("systemcalls" | "syscalls" | "sc")
                        | ("branchsim" | "bs")
                        | ("writebackbehaviour" | "writeback" | "wb")
                        )
          event ::= EventKind

          See the guide (https://iai-callgrind.github.io/iai-callgrind/latest/html/regressions.html)
          for more details, the docs of `CallgrindMetrics`
          (<https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.CallgrindMetrics.html>) and
          `EventKind` <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.EventKind.html> for a
          list of metrics and groups with their members.

          A performance regression check for an `EventKind` fails if the limit is exceeded. If
          limits are defined and one or more regressions have occurred during the benchmark run,
          the whole benchmark is considered to have failed and the program exits with error and
          exit code `3`.

          Examples:
          * --callgrind-limits='ir=5.0%'
          * --callgrind-limits='ir=10000,EstimatedCycles=10%'
          * --callgrind-limits='@all=10%,ir=5%|10000'

          [env: IAI_CALLGRIND_CALLGRIND_LIMITS=]

      --dhat-limits <DHAT_LIMITS>
          Set performance regression limits for specific dhat metrics

          This is a `,` separate list of DhatMetrics=limit or DhatMetric=limit (key=value) pairs. See
          the description of --callgrind-limits for the details and
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.DhatMetrics.html> respectively
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.DhatMetric.html> for valid metrics
          and group members.

          See the the guide
          (https://iai-callgrind.github.io/iai-callgrind/latest/html/regressions.html) for all
          details or replace the format spec in `--callgrind-limits` with the following:

          group ::= "@" ( "default" | "all" )
          event ::=   ( "totalunits" | "tun" )
                    | ( "totalevents" | "tev" )
                    | ( "totalbytes" | "tb" )
                    | ( "totalblocks" | "tbk" )
                    | ( "attgmaxbytes" | "gb" )
                    | ( "attgmaxblocks" | "gbk" )
                    | ( "attendbytes" | "eb" )
                    | ( "attendblocks" | "ebk" )
                    | ( "readsbytes" | "rb" )
                    | ( "writesbytes" | "wb" )
                    | ( "totallifetimes" | "tl" )
                    | ( "maximumbytes" | "mb" )
                    | ( "maximumblocks" | "mbk" )

          `events` with a long name have their allowed abbreviations placed in the same parentheses.

          Examples:
          * --dhat-limits='totalbytes=0.0%'
          * --dhat-limits='totalbytes=10000,totalblocks=5%'
          * --dhat-limits='@all=10%,totalbytes=5000,totalblocks=5%'

          [env: IAI_CALLGRIND_DHAT_LIMITS=]

      --regression-fail-fast[=<REGRESSION_FAIL_FAST>]
          If true, the first failed performance regression check fails the whole benchmark run

          Note that if --regression-fail-fast is set to true, no summary is printed.

          [env: IAI_CALLGRIND_REGRESSION_FAIL_FAST=]
          [possible values: true, false]

      --cachegrind-metrics <CACHEGRIND_METRICS>...
          Define the cachegrind metrics and the order in which they are displayed

          This is a `,`-separated list of cachegrind metric groups and event kinds which are allowed
          to appear in the terminal output of cachegrind.

          See `--callgrind-metrics` for more details and
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.CachegrindMetrics.html>
          respectively
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.CachegrindMetric.html> for valid
          metrics and group members.

          The `group` names, their abbreviations if present and `event` kinds are exactly the same as
          described in the `--cachegrind-limits` option.

          Examples:
          * --cachegrind-metrics='ir' to show only `Instructions`
          * --cachegrind-metrics='@all' to show all possible cachegrind metrics
          * --cachegrind-metrics='@default,@mr' to show cache miss rates in addition to the defaults

          [env: IAI_CALLGRIND_CACHEGRIND_METRICS=]

      --callgrind-metrics <CALLGRIND_METRICS>...
          Define the callgrind metrics and the order in which they are displayed

          This is a `,`-separated list of callgrind metric groups and event kinds which are allowed
          to appear in the terminal output of callgrind. Group names need to be prefixed with '@'.
          The order matters and the callgrind metrics are shown in their insertion order of this
          option. More precisely, in case of duplicate metrics, the first specified one wins.

          The `group` names, their abbreviations if present and `event` kinds are exactly the same as
          described in the `--callgrind-limits` option.

          For a list of valid metrics, groups and their members see the docs of `CallgrindMetrics`
          (<https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.CallgrindMetrics.html>) and
          `EventKind` <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.EventKind.html>.

          Note that setting the metrics here does not imply that these metrics are actually
          collected. This option just sets the order and appearance of metrics in case they are
          collected. To activate the collection of specific metrics you need to use
          `--callgrind-args`.

          Examples:
          * --callgrind-metrics='ir' to show only `Instructions`
          * --callgrind-metrics='@all' to show all possible callgrind metrics
          * --callgrind-metrics='@default,@mr' to show cache miss rates in addition to the defaults

          [env: IAI_CALLGRIND_CALLGRIND_METRICS=]

      --dhat-metrics <DHAT_METRICS>...
          Define the dhat metrics and the order in which they are displayed

          This is a `,`-separated list of dhat metric groups and event kinds which are allowed to
          appear in the terminal output of dhat.

          See `--callgrind-metrics` for more details and
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.DhatMetrics.html> respectively
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.DhatMetric.html> for valid metrics
          and group members.

          The `group` names, their abbreviations if present and `event` kinds are exactly the same as
          described in the `--dhat-limits` option.

          Examples:
          * --dhat-metrics='totalbytes' to show only `Total Bytes`
          * --dhat-metrics='@all' to show all possible dhat metrics
          * --dhat-metrics='@default,mb' to show maximum bytes in addition to the defaults

          [env: IAI_CALLGRIND_DHAT_METRICS=]

      --drd-metrics <DRD_METRICS>...
          Define the drd error metrics and the order in which they are displayed

          This is a `,`-separated list of error metrics which are allowed to appear in the terminal
          output of drd. The `group` and `event` are the same as for `--memcheck-metrics`.

          See `--callgrind-metrics` for more details and
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.ErrorMetric.html> for valid error
          metrics.

          Since this is a very small set of metrics, there is only one `group`: `@all`

          Examples:
          * --drd-metrics='errors' to show only `Errors`
          * --drd-metrics='@all' to show all possible error metrics (the default)
          * --drd-metrics='err,ctx' to show only errors and contexts

          [env: IAI_CALLGRIND_DRD_METRICS=]

      --helgrind-metrics <HELGRIND_METRICS>...
          Define the helgrind error metrics and the order in which they are displayed

          This is a `,`-separated list of error metrics which are allowed to appear in the terminal
          output of helgrind. The `group` and `event` are the same as for `--memcheck-metrics`.

          See `--callgrind-metrics` for more details and
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.ErrorMetric.html> for valid error
          metrics.

          Examples:
          * --helgrind-metrics='errors' to show only `Errors`
          * --helgrind-metrics='@all' to show all possible error metrics (the default)
          * --helgrind-metrics='err,ctx' to show only errors and contexts

          [env: IAI_CALLGRIND_HELGRIND_METRICS=]

      --memcheck-metrics <MEMCHECK_METRICS>...
          Define the memcheck error metrics and the order in which they are displayed

          This is a `,`-separated list of error metrics which are allowed to appear in the terminal
          output of memcheck.

          Since this is a very small set of metrics, there is only one `group`: `@all`

          group ::= "@all"
          event ::=   ( "errors" | "err" )
                    | ( "contexts" | "ctx" )
                    | ( "suppressederrors" | "serr")
                    | ( "suppressedcontexts" | "sctx" )

          See `--callgrind-metrics` for more details and
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.ErrorMetric.html> for valid
          metrics.

          Examples:
          * --memcheck-metrics='errors' to show only `Errors`
          * --memcheck-metrics='@all' to show all possible error metrics (the default)
          * --memcheck-metrics='err,ctx' to show only errors and contexts

          [env: IAI_CALLGRIND_MEMCHECK_METRICS=]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

  Exit codes:
      0: Success
      1: All other errors
      2: Parsing command-line arguments failed
      3: One or more regressions occurred

Comparing with baselines

Usually, two consecutive benchmark runs let Iai-Callgrind compare these two runs. It's sometimes desirable to compare the current benchmark run against a static reference, instead. For example, if you're working longer on the implementation of a feature, you may wish to compare against a baseline from another branch or the commit from which you started off hacking on your new feature to make sure you haven't introduced performance regressions. Iai-Callgrind offers such custom baselines. If you are familiar with criterion.rs, the following command line arguments should also be very familiar to you:

--save-baseline=NAME (env: IAI_CALLGRIND_SAVE_BASELINE): Compare against the NAME baseline if present and then overwrite it.
--baseline=NAME (env: IAI_CALLGRIND_BASELINE): Compare against the NAME baseline without overwriting it
--load-baseline=NAME (env: IAI_CALLGRIND_LOAD_BASELINE): Load the NAME baseline as the new data set instead of creating a new one. This option needs also --baseline=NAME to be present.

If NAME is not present, NAME defaults to default.

For example to create a static reference from the main branch and compare it:

git checkout main
cargo bench --bench <benchmark> -- --save-baseline=main
git checkout feature
# ... HACK ... HACK
cargo bench --bench <benchmark> -- --baseline main

Sticking to the above execution sequence,

cargo bench --bench my_benchmark -- --save-baseline=main

prints something like that with an additional line Baselines in the output.

my_benchmark::my_group::bench_library
  Baselines:                   main|main
  Instructions:                 280|N/A             (*********)
  L1 Hits:                      374|N/A             (*********)
  LL Hits:                        1|N/A             (*********)
  RAM Hits:                       6|N/A             (*********)
  Total read+write:             381|N/A             (*********)
  Estimated Cycles:             589|N/A             (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

After you've made some changes to your code, running

cargo bench --bench my_benchmark -- --baseline=main`

prints something like the following:

my_benchmark::my_group::bench_library
  Baselines:                       |main
  Instructions:                 214|280             (-23.5714%) [-1.30841x]
  L1 Hits:                      287|374             (-23.2620%) [-1.30314x]
  LL Hits:                        1|1               (No change)
  RAM Hits:                       6|6               (No change)
  Total read+write:             294|381             (-22.8346%) [-1.29592x]
  Estimated Cycles:             502|589             (-14.7708%) [-1.17331x]

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

Controlling the output of Iai-Callgrind

This section describes command-line options and environment variables which influence the terminal, file and logging output of Iai-Callgrind.

Customize the output directory

All output files of Iai-Callgrind are usually stored using the following scheme:

$WORKSPACE_ROOT/target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID

This directory structure can partly be changed with the following options.

Callgrind Home

Per default, all benchmark output files are stored under the $WORKSPACE_ROOT/target/iai directory tree. This home directory can be changed with the IAI_CALLGRIND_HOME environment variable or the command-line argument --home. The command-line argument overwrites the value of the environment variable. For example to store all files under the /tmp/iai-callgrind directory you can use IAI_CALLGRIND_HOME=/tmp/iai-callgrind or cargo bench -- --home=/tmp/iai-callgrind.

Separate targets

If you're running the benchmarks on different targets, it's necessary to separate the output files of the benchmark runs per target or else you could end up comparing the benchmarks with the wrong target leading to strange results. You can achieve this with different baselines per target, but it's much less painful to separate the output files by target with the --separate-targets command-line argument or setting the environment variable IAI_CALLGRIND_SEPARATE_TARGETS=yes). The output directory structure changes from

target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID

target/iai/$TARGET_TRIPLE/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID

For example, assuming the library benchmark file name is bench_file in the package my_package

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

#[library_benchmark]
#[bench::short(vec![4, 3, 2, 1])]
fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(values))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort);

fn main() {
main!(library_benchmark_groups = my_group);
}

Without --separate-targets:

target/iai/my_package/bench_file/my_group/bench_bubble_sort.short

and with --separate-targets assuming you're running the benchmark on the x86_64-unknown-linux-gnu target:

target/iai/x86_64-unknown-linux-gnu/my_package/bench_file/my_group/bench_bubble_sort.short

Machine-readable output

With --output-format=default|json|pretty-json (env: IAI_CALLGRIND_OUTPUT_FORMAT) you can change the terminal output format to the machine-readable json format. The json schemas fully describing the json output are stored here:

Iai-Callgrind version	Schema version
>=0.9.0,<0.11.0	summary.v1.schema.json
>=0.11.0,<0.14.0	summary.v2.schema.json
>=0.14.0,<0.15.0	summary.v3.schema.json
>=0.15.0,<0.15.2	summary.v4.schema.json
>=0.15.2,<0.16.0	summary.v5.schema.json
>=0.16.0	summary.v6.schema.json

Each line of json output (if not pretty-json) is a summary of a single benchmark, and you may want to combine all benchmarks in an array. You can do so for example with jq

cargo bench -- --output-format=json | jq -s

which transforms {...}\n{...} into [{...},{...}].

Instead of, or in addition to changing the terminal output, it's possible to save a summary file for each benchmark with --save-summary=json|pretty-json (env: IAI_CALLGRIND_SAVE_SUMMARY). The summary.json files are stored next to the usual benchmark output files in the target/iai directory.

Showing terminal output of benchmarks

Per default, all terminal output of the benchmark function, setup and teardown is captured and therefore not shown during a benchmark run.

Using the log level

The most basic possibility to show any captured output, is to use IAI_CALLGRIND_LOG=info. This includes a lot of other output, too.

Tell Iai-Callgrind to not capture the output

Another nicer possibility is, to tell Iai-Callgrind to not capture output with the --nocapture (env: IAI_CALLGRIND_NOCAPTURE) option. This is currently restricted to the callgrind run to prevent showing the same output multiple times. So, any terminal output of other tool runs is still captured.

The --nocapture flag takes the special values stdout and stderr in addition to true and false:

--nocapture=true|false|stdout|stderr

In the --nocapture=stdout case, terminal output to stdout is not captured and shown during the benchmark run but output to stderr is discarded. Likewise, --nocapture=stderr shows terminal output to stderr but discards output to stdout.

Let's take as example a library benchmark benches/my_benchmark.rs

extern crate iai_callgrind;
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

fn print_to_stderr(value: u64) {
    eprintln!("Error output during teardown: {value}");
}

fn add_10_and_print(value: u64) -> u64 {
    let value = value + 10;
    println!("Output to stdout: {value}");

    value
}

#[library_benchmark]
#[bench::some_id(args = (10), teardown = print_to_stderr)]
fn bench_library(value: u64) -> u64 {
    black_box(add_10_and_print(value))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}

If the above benchmark is run with cargo bench --bench my_benchmark -- --nocapture, the output of Iai-Callgrind will look like this:

my_benchmark::my_group::bench_library some_id:10
Output to stdout: 20
Error output during teardown: 20
- end of stdout/stderr
  Instructions:                 851|N/A             (*********)
  L1 Hits:                     1193|N/A             (*********)
  LL Hits:                        5|N/A             (*********)
  RAM Hits:                      66|N/A             (*********)
  Total read+write:            1264|N/A             (*********)
  Estimated Cycles:            3528|N/A             (*********)

Iai-Callgrind result: Ok. 1 without regressions; 0 regressed; 1 benchmarks finished in 0.49333s

Everything between the headline and the - end of stdout/stderr line is output from your benchmark. The - end of stdout/stderr line changes depending on the options you have given. For example in the --nocapture=stdout case this line indicates your chosen option with - end of stdout.

Note that independently of the value of the --nocapture option, all logging output of a valgrind tool itself is stored in files in the output directory of the benchmark. Since Iai-Callgrind needs the logging output of valgrind tools stored in files, there is no option to disable the creation of these log files. But, if anything goes sideways you might be glad to have the log files around.

Changing the color output

The terminal output is colored per default but follows the value for the IAI_CALLGRIND_COLOR environment variable. If IAI_CALLGRIND_COLOR is not set, CARGO_TERM_COLOR is also tried. Accepted values are:

always, never, auto (default).

So, disabling colors can be achieved with setting IAI_CALLGRIND_COLOR or CARGO_TERM_COLOR=never.

Changing the logging output

Iai-Callgrind uses env_logger and the default logging level WARN. To set the logging level to something different, set the environment variable IAI_CALLGRIND_LOG for example to IAI_CALLGRIND_LOG=DEBUG. Accepted values are:

error, warn (default), info, debug, trace.

The logging output is colored per default but follows the Color settings.

See also the documentation of env_logger.

I'm getting the error Sentinel ... not found

You've most likely disabled creating debug symbols in your cargo bench profile. This can originate in an option you've added to the release profile since the bench profile inherits the release profile. For example, if you've added strip = true to your release profile which is perfectly fine, you need to disable this option in your bench profile to be able to run Iai-Callgrind benchmarks.

See also the Debug Symbols section in Installation/Prerequisites.

Running cargo bench results in an "Unrecognized Option" error

For

cargo bench -- --some-valid-arg

to work you can either specify the benchmark with --bench BENCHMARK, for example

cargo bench --bench my_iai_benchmark -- --callgrind-args="--collect-bus=yes"

or add the following to your Cargo.toml:

[lib]
bench = false

and if you have binaries

[[bin]]
name = "my-binary"
path = "src/bin/my-binary.rs"
bench = false

Setting bench = false disables the creation of the implicit default libtest harness which is added even if you haven't used #[bench] functions in your library or binary. Naturally, the default harness doesn't know of the Iai-Callgrind arguments and aborts execution printing the Unrecognized Option error.

If you cannot or don't want to add bench = false to your Cargo.toml, you can alternatively use environment variables. For every command-line argument exists a corresponding environment variable.

Comparison of Iai-Callgrind with Criterion-rs

This is a comparison with Criterion-rs but some of the points in Pros and Cons also apply to other wall-clock time based benchmarking frameworks.

Iai-Callgrind Pros:

Iai-Callgrind can give answers that are repeatable to 7 or more significant digits. In comparison, actual (wall-clock) run times are scarcely repeatable beyond one significant digit.

This allows to implement and measure "microoptimizations". Typical microoptimizations reduce the number of CPU cycles by 0.1% or 0.05% or even less. Such improvements are impossible to measure with real-world timings. But hundreds or thousands of microoptimizations add up, resulting in measurable real-world performance gains.¹
Iai-Callgrind can work reliably in noisy environments especially in CI environments from providers like GitHub Actions or Travis-CI, where Criterion-rs cannot.
The benchmark api of Iai-Callgrind is simple, intuitive and allows for a much more concise and clearer structure of benchmarks.
Iai-Callgrind can benchmark functions in binary crates.
Iai-Callgrind can benchmark private functions.
Although Callgrind adds runtime overhead, running each benchmark exactly once is still usually much faster than Criterion-rs' statistical measurements.
Criterion-rs creates plots and graphs about the averages, median etc. which adds considerable execution time to the execution time for each benchmark. Iai-Callgrind doesn't need any of these plots, since it can collect all its metrics in a single run.
Iai-Callgrind generates profile output from the benchmark without further effort.
With Iai-Callgrind you have native access to all the possibilities of all Valgrind tools, including Valgrind Client Requests.

Iai-Callgrind/Criterion-rs Mixed:

Although it is usually not significant, due to the high precision of the Iai-Callgrind measurements changes in the benchmarks themselves like adding a
benchmark case can have an effect on the other benchmarks. Iai-Callgrind can only try to reduce these effects to a minimum but never completely eliminate them. Criterion-rs does not have this problem because it cannot detect such small changes.

Iai-Callgrind Cons:

Iai-Callgrind's measurements merely correlate with wall-clock time. Wall-clock time is an obvious choice in many cases because it corresponds to what users perceive and Criterion-rs measures it directly.
Iai-Callgrind can only be used on platforms supported by Valgrind. Notably, this does not include Windows.
Iai-Callgrind needs additional binaries, valgrind and the iai-callgrind-runner. The version of the runner needs to be in sync with the iai-callgrind library. Criterion-rs is only a library and the installation is usually simpler.

Especially, due to the first point in the Cons, I think it is still required to run wall-clock time benchmarks and use Criterion-rs in conjunction with Iai-Callgrind. But in the CI and for performance regression checks, you shouldn't use Criterion-rs or other wall-clock time based benchmarks at all.

https://sqlite.org/cpu.html#performance_measurement

Comparison of Iai-Callgrind with Iai

This is a comparison with Iai. There is no known downside in using Iai-Callgrind instead of Iai. Although the original idea of Iai will always be remembered, Iai-Callgrind has surpassed Iai over the years in functionality, stability and flexibility.

Iai-Callgrind Pros:

Iai-Callgrind is actively maintained.
The user interface and benchmarking api of Iai-Callgrind is simple, intuitive and allows for a much more concise and clearer structure of benchmarks.
Iai-Callgrind excludes setup code from the metrics of interest natively. The metrics are more stable because the benchmark function is virtually encapsulated by Callgrind and separates the benchmarked code from the surrounding code.
Full support for benchmarking multi-threaded/multi-process functions/binaries.
Can still run Cachegrind but with a real one-shot implementation using client requests instead of a calibration run.
Support of memory profiling with DHAT and Massif.
Running error checking valgrind tools is a few keystrokes away if you really need them.
The Callgrind output files are much more focused on the benchmark function and the function under test than the Cachegrind output files that Iai produces. The calibration run of Iai only sanitized the visible summary output but not the metrics in the output files themselves. So, the output of cg_annotate was still cluttered by the initialization code, setup functions and metrics.
Changes to the library of Iai-Callgrind have almost never an influence on the benchmark metrics, since the actual runner (iai-callgrind-runner) and thus 99% of the code needed to run the benchmarks is isolated from the benchmarks by an independent binary. In contrast to the library of Iai which is compiled together with the benchmarks.
Iai-Callgrind has functionality in place that provides a constant and reproducible benchmarking environment, like the Sandbox and clearing environment variables.
Customizable output format to be able to show all Callgrind/Cachegrind/DHAT/... metrics or only the set of metrics you're interested in.
Comparison by id of benchmark functions.
Iai-Callgrind can be configured to check for performance regressions.
Ships with a complete implementation of Valgrind Client Requests
Comparison of benchmarks to baselines instead of only to .old files.
Natively supports benchmarking binaries.
Iai-Callgrind can print and/or save machine-readable output in .json format.
Fixed the wrong labeling of L1 Accesses, ... to L1 Hits, ...

Iai-Callgrind Guide