Introduction

This is the guide for Iai-Callgrind, a benchmarking framework/harness which uses Valgrind's Callgrind and other Valgrind tools like DHAT, Massif, ... to provide extremely accurate and consistent measurements of Rust code, making it perfectly suited to run in environments like a CI.

Iai_Callgrind is fully documented in this guide and in the api documentation at docs.rs.

Iai-Callgrind is

  • Precise: High-precision measurements of Instruction counts and many other metrics allow you to reliably detect very small optimizations and regressions of your code.
  • Consistent: Iai-Callgrind can take accurate measurements even in virtualized CI environments and make them comparable between different systems completely negating the noise of the environment.
  • Fast: Each benchmark is only run once, which is usually much faster than benchmarks which measure execution and wall-clock time. Benchmarks measuring the wall-clock time have to be run many times to increase their accuracy, detect outliers, filter out noise, etc.
  • Visualizable: Iai-Callgrind generates a Callgrind (DHAT, ...) profile of the benchmarked code and can be configured to create flamegraph-like charts from Callgrind metrics. In general, all Valgrind-compatible tools like callgrind_annotate, kcachegrind or dh_view.html and others to analyze the results in detail are fully supported.
  • Easy: The API for setting up benchmarks is easy to use and allows you to quickly create concise and clear benchmarks. Focus more on profiling and your code than on the framework.

Design philosophy and goals

Iai-Callgrind benchmarks are designed to be runnable with cargo bench. The benchmark files are expanded to a benchmarking harness which replaces the native benchmark harness of rust. Iai-Callgrind is a profiling framework that can quickly and reliably detect performance regressions and optimizations even in noisy environments with a precision that is impossible to achieve with wall-clock time based benchmarks. At the same time, we want to abstract the complicated parts and repetitive tasks away and provide an easy to use and intuitive api. Iai-Callgrind tries to stay out of your way so you can focus more on profiling and your code!

When not to use Iai-Callgrind

Although Iai-Callgrind is useful in many projects, there are cases where Iai-Callgrind is not a good fit.

  • If you need wall-clock times, Iai-Callgrind cannot help you much. The estimation of cpu cycles merely correlates to wall-clock times but is not a replacement for wall-clock times. The cycles estimation is primarily designed to be a relative metric to be used for comparison.
  • Iai-Callgrind cannot be run on Windows and platforms not supported by Valgrind.

Improving Iai-Callgrind

No one's perfect!

You want to share your experience with Iai-Callgrind and have a recipe that might be useful for others and fits into this guide? You have an idea for a new feature, are missing a functionality or have found a bug? We would love to here about it. You want to contribute and hack on Iai-Callgrind?

Please don't hesitate to open an issue.

You want to hack on this guide? The source code of this book lives in the docs subdirectory.

Getting Help

Reach out to us on Github Discussions or open an Issue in the Iai-Callgrind Repository. Check the open and closed issues in the issue board, maybe you can already find a solution to your problem there.

The api documentation can be found on docs.rs but you might also want to check out the Troubleshooting section in the sidebar of this guide.

Prerequisites

In order to use Iai-Callgrind, you must have Valgrind installed. This means that Iai-Callgrind cannot be used on platforms that are not supported by Valgrind.

Debug Symbols

It's required to run the Iai-Callgrind benchmarks with debugging symbols switched on. For example in your ~/.cargo/config or your project's Cargo.toml:

[profile.bench]
debug = true

Now, all benchmarks which are run with cargo bench include the debug symbols. (See also Cargo Profiles and Cargo Config).

It's required that settings like strip = true or other configuration options stripping the debug symbols need to be disabled explicitly for the bench profile if you have changed this option for the release profile. For example:

[profile.release]
strip = true

[profile.bench]
debug = true
strip = false

Valgrind Client Requests

If you want to make use of the mighty Valgrind Client Request Mechanism shipped with Iai-Callgrind, you also need libclang (clang >= 5.0) installed. See also the requirements of bindgen and of cc.

More details on the usage and requirements of Valgrind Client Requests in this chapter of the guide.

Installation of Valgrind

Iai-Callgrind is intentionally independent of a specific version of valgrind. However, Iai-Callgrind was only tested with versions of valgrind >= 3.20.0. It is therefore highly recommended to use a recent version of valgrind. Bugs get fixed, the supported platforms are expanded ... Also, if you want or need to, building valgrind from source is usually a straight-forward process. Just make sure the valgrind binary is in your $PATH so that Iai-callgrind can find it.

Installation of valgrind with your package manager

Alpine Linux

apk add just

Arch Linux

pacman -Sy valgrind

Debian/Ubuntu

apt-get install valgrind

Fedora Linux

dnf install valgrind

FreeBSD

pkg install valgrind

Valgrind is available for the following distributions

Packaging status

Iai-Callgrind

Iai-Callgrind is divided into the library iai-callgrind and the benchmark runner iai-callgrind-runner.

Installation of the library

To start with Iai-Callgrind, add the following to your Cargo.toml file:

[dev-dependencies]
iai-callgrind = "0.14.0"

or run

cargo add --dev iai-callgrind@0.14.0

Installation of the benchmark runner

To be able to run the benchmarks you'll also need the iai-callgrind-runner binary installed somewhere in your $PATH. Otherwise, there is no need to interact with iai-callgrind-runner as it is just an implementation detail.

From Source

cargo install --version 0.14.0 iai-callgrind-runner

There's also the possibility to install the binary somewhere else and point the IAI_CALLGRIND_RUNNER environment variable to the absolute path of the iai-callgrind-runner binary like so:

cargo install --version 0.14.0 --root /tmp iai-callgrind-runner
IAI_CALLGRIND_RUNNER=/tmp/bin/iai-callgrind-runner cargo bench --bench my-bench

Binstall

The iai-callgrind-runner binary is pre-built for most platforms supported by valgrind and easily installable with binstall

cargo binstall iai-callgrind-runner@0.14.0

Updating

When updating the iai-callgrind library, you'll also need to update iai-callgrind-runner and vice-versa or else the benchmark runner will exit with an error.

In the Github CI

Since the iai-callgrind-runner version must match the iai-callgrind library version it's best to automate this step in the CI. A job step in the github actions CI could look like this

- name: Install iai-callgrind-runner
  run: |
    version=$(cargo metadata --format-version=1 |\
      jq '.packages[] | select(.name == "iai-callgrind").version' |\
      tr -d '"'
    )
    cargo install iai-callgrind-runner --version $version

Or, speed up the overall installation time with binstall using the taiki-e/install-action

- uses: taiki-e/install-action@cargo-binstall
- name: Install iai-callgrind-runner
  run: |
    version=$(cargo metadata --format-version=1 |\
      jq '.packages[] | select(.name == "iai-callgrind").version' |\
      tr -d '"'
    )
    cargo binstall --no-confirm iai-callgrind-runner --version $version

Overview

Iai-Callgrind can be used to benchmark the library and binary of your project's crates. Library and binary benchmarks are treated differently by Iai-Callgrind and cannot be intermixed in the same benchmark file. This is indeed a feature and helps keeping things organized. Having different and multiple benchmark files for library and binary benchmarks is no problem for Iai-Callgrind and is usually a good idea anyway. Having benchmarks for different binaries in the same benchmark file however is fully supported.

Head over to the Quickstart section of library benchmarks if you want to start benchmarking your library functions or to the Quickstart section of binary benchmarks if you want to start benchmarking your crate's binary (binaries).

Binary Benchmarks vs Library Benchmarks

Almost all binary benchmarks can be written as library benchmarks. For example, if you have a main.rs file of your binary, which basically looks like this

mod my_lib { pub fn run() {} }
use my_lib::run;

fn main() {
    run();
}

you could also choose to benchmark the library function my_lib::run in a library benchmark instead of the binary in a binary benchmark. There's no real downside to either of the benchmark schemes and which scheme you want to use heavily depends on the structure of your binary. As a maybe obvious rule of thumb, micro-benchmarks of specific functions should go into library benchmarks and macro-benchmarks into binary benchmarks. Generally, choose the closest access point to the program point you actually want to benchmark.

You should always choose binary benchmarks over library benchmarks if you want to benchmark the behaviour of the executable if the input comes from a pipe since this feature is exclusive to binary benchmarks. See The Command's stdin and simulating piped input for more.

Library Benchmarks

You want to dive into benchmarking your library? Best start with the Quickstart section and then go through the examples in the other sections of this guide. If you need more examples see here

Important default behaviour

The environment variables are cleared before running a library benchmark. Have a look into the Configuration section if you need to change that behavior. Iai-Callgrind sometimes deviates from the valgrind defaults which are:

Iai-CallgrindValgrind (v3.23)
--trace-children=yes--trace-children=no
--fair-sched=try--fair-sched=no
--separate-threads=yes--separate-threads=no
--cache-sim=yes--cache-sim=no

The thread and subprocess specific valgrind options enable tracing threads and subprocesses basically but there's usually some additional configuration necessary to trace the metrics of threads and subprocesses.

As show in the table above, the benchmarks run with cache simulation switched on. This adds run time. If you don't need the cache metrics and estimation of cycles, you can easily switch cache simulation off for example with:

#![allow(unused)]
fn main() {
extern crate iai_callgrind;
use iai_callgrind::LibraryBenchmarkConfig;

LibraryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]);
}

To switch off cache simulation for all benchmarks in the same file:

extern crate iai_callgrind;
mod my_lib { pub fn fibonacci(a: u64) -> u64 { a } }
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig
};
use std::hint::black_box;

#[library_benchmark]
fn bench_fibonacci() -> u64 {
    black_box(my_lib::fibonacci(10))
}

library_benchmark_group!(name = fibonacci_group; benchmarks = bench_fibonacci);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]);
    library_benchmark_groups = fibonacci_group
);
}

Quickstart

Create a file $WORKSPACE_ROOT/benches/library_benchmark.rs and add

[[bench]]
name = "library_benchmark"
harness = false

to your Cargo.toml. harness = false, tells cargo to not use the default rust benchmarking harness which is important because Iai-Callgrind has an own benchmarking harness.

Then copy the following content into this file:

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

#[library_benchmark]
#[bench::short(10)]
#[bench::long(30)]
fn bench_fibonacci(value: u64) -> u64 {
    black_box(fibonacci(value))
}

library_benchmark_group!(
    name = bench_fibonacci_group;
    benchmarks = bench_fibonacci
);

fn main() {
main!(library_benchmark_groups = bench_fibonacci_group);
}

Now, that your first library benchmark is set up, you can run it with

cargo bench

and should see something like the below

library_benchmark::bench_fibonacci_group::bench_fibonacci short:10
  Instructions:                1734|N/A             (*********)
  L1 Hits:                     2359|N/A             (*********)
  L2 Hits:                        0|N/A             (*********)
  RAM Hits:                       3|N/A             (*********)
  Total read+write:            2362|N/A             (*********)
  Estimated Cycles:            2464|N/A             (*********)
library_benchmark::bench_fibonacci_group::bench_fibonacci long:30
  Instructions:            26214734|N/A             (*********)
  L1 Hits:                 35638616|N/A             (*********)
  L2 Hits:                        2|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:        35638622|N/A             (*********)
  Estimated Cycles:        35638766|N/A             (*********)

In addition, you'll find the callgrind output and the output of other valgrind tools in target/iai, if you want to investigate further with a tool like callgrind_annotate etc.

When running the same benchmark again, the output will report the differences between the current and the previous run. Say you've made change to the fibonacci function, then you may see something like this:

library_benchmark::bench_fibonacci_group::bench_fibonacci short:10
  Instructions:                2805|1734            (+61.7647%) [+1.61765x]
  L1 Hits:                     3815|2359            (+61.7211%) [+1.61721x]
  L2 Hits:                        0|0               (No change)
  RAM Hits:                       3|3               (No change)
  Total read+write:            3818|2362            (+61.6427%) [+1.61643x]
  Estimated Cycles:            3920|2464            (+59.0909%) [+1.59091x]
library_benchmark::bench_fibonacci_group::bench_fibonacci long:30
  Instructions:            16201597|26214734        (-38.1966%) [-1.61803x]
  L1 Hits:                 22025876|35638616        (-38.1966%) [-1.61803x]
  L2 Hits:                        2|2               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:        22025882|35638622        (-38.1966%) [-1.61803x]
  Estimated Cycles:        22026026|35638766        (-38.1964%) [-1.61803x]

Anatomy of a library benchmark

We're reusing our example from the Quickstart section.

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

#[library_benchmark]
#[bench::short(10)]
#[bench::long(30)]
fn bench_fibonacci(value: u64) -> u64 {
    black_box(fibonacci(value))
}

library_benchmark_group!(
    name = bench_fibonacci_group;
    benchmarks = bench_fibonacci
);

fn main() {
main!(library_benchmark_groups = bench_fibonacci_group);
}

First of all, you need a public function in your library which you want to benchmark. In this example this is the fibonacci function which, for the sake of simplicity, lives in the benchmark file itself but doesn't have to. If it had been located in my_lib::fibonacci, you simply import that function with use my_lib::fibonacci and go on as shown above. Next, you need a library_benchmark_group! in which you specify the names of the benchmark functions. Finally, the benchmark harness is created by the main! macro.

The benchmark function

The benchmark function has to be annotated with the #[library_benchmark] attribute. The #[bench] attribute is an inner attribute of the #[library_benchmark] attribute. It consists of a mandatory id (the ID part in #[bench::ID(/* ... */)]) and in its most basic form, an optional list of arguments which are passed to the benchmark function as parameters. Naturally, the parameters of the benchmark function must match the argument list of the #[bench] attribute. It is always a good idea to return something from the benchmark function, here it is the computed u64 value from the fibonacci function wrapped in a black_box. See the docs of std::hint::black_box for more information about its usage. Simply put, all values and variables in the benchmarking function (but not in your library function) need to be wrapped in a black_box except for the input parameters (here value) because Iai-Callgrind already does that. But, it is no error to black_box the value again.

The bench attribute takes any expression which includes function calls. The following would have worked too and is one way to avoid the costs of the setup code being attributed to the benchmarked function.

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

fn some_setup_func(value: u64) -> u64 {
    value + 10
}

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

#[library_benchmark]
#[bench::short(10)]
// Note the usage of the `some_setup_func` in the argument list of this #[bench]
#[bench::long(some_setup_func(20))]
fn bench_fibonacci(value: u64) -> u64 {
    black_box(fibonacci(value))
}

library_benchmark_group!(
   name = bench_fibonacci_group;
   benchmarks = bench_fibonacci
);

fn main() {
main!(library_benchmark_groups = bench_fibonacci_group);
}

The perhaps most crucial part in setting up library benchmarks is to keep the body of benchmark functions clean from any setup or teardown code. There are other ways to avoid setup and teardown code in the benchmark function, which are discussed in full detail in the setup and teardown section.

The group

The name of the benchmark functions, here the only benchmark function bench_fibonacci, which should be benchmarked need to be specified in a library_benchmark_group! in the benchmarks parameter. You can create as many groups as you like, and you can use it to organize related benchmarks. Each group needs a unique name.

The main macro

Each group you want to be benchmarked needs to be specified in the library_benchmark_groups parameter of the main! macro and you're all set.

The macros in more detail

This section is a brief reference to all the macros available in library benchmarks. Feel free to come back here from other sections if you need a reference. For the complete documentation of each macro see the api Documentation.

For the following examples it is assumed that there is a file lib.rs in a crate named my_lib with the following content:

#![allow(unused)]
fn main() {
pub fn bubble_sort(mut array: Vec<i32>) -> Vec<i32> {
    for i in 0..array.len() {
        for j in 0..array.len() - i - 1 {
            if array[j + 1] < array[j] {
                array.swap(j, j + 1);
            }
        }
    }
    array
}
}

The #[library_benchmark] attribute

This attribute needs to be present on all benchmark functions specified in the library_benchmark_group. The benchmark function can then be further annotated with the inner #[bench] or #[benches] attributes.

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[bench::one(vec![1])]
#[benches::multiple(vec![1, 2], vec![1, 2, 3], vec![1, 2, 3, 4])]
fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(values))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The following parameters are accepted:

  • config: Takes a LibraryBenchmarkConfig
  • setup: A global setup function which is applied to all following #[bench] and #[benches] attributes if not overwritten by a setup parameter of these attributes.
  • teardown: Similar to setup but takes a global teardown function.
extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
           .truncate_description(None)
        )
)]
#[bench::one(vec![1])]
fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(values))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The #[bench] attribute

The basic structure is #[bench::some_id(/* parameters */)]. The part after the :: must be an id unique within the same #[library_benchmark]. This attribute accepts the following parameters:

  • args: A tuple with a list of arguments which are passed to the benchmark function. The parentheses also need to be present if there is only a single argument (#[bench::my_id(args = (10))]).
  • config: Accepts a LibraryBenchmarkConfig
  • setup: A function which takes the arguments specified in the args parameter and passes its return value to the benchmark function.
  • teardown: A function which takes the return value of the benchmark function.

If no other parameters besides args are present you can simply pass the arguments as a list of values. So, instead of #[bench::my_id(args = (10, 20))], you could also use the shorter #[bench::my_id(10, 20)].

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig};
use std::hint::black_box;

// This function is used to create a worst case array we want to sort with our implementation of
// bubble sort
pub fn worst_case(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[bench::one(vec![1])]
#[bench::worst_two(args = (vec![2, 1]))]
#[bench::worst_four(args = (4), setup = worst_case)]
fn bench_bubble_sort(value: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(value))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The #[benches] attribute

This attribute is used to specify multiple benchmarks at once. It accepts the same parameters as the #[bench] attribute: args, config, setup and teardown and additionally the file parameter which is explained in detail here. In contrast to the args parameter in #[bench], args takes an array of arguments.

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig};
use std::hint::black_box;

pub fn worst_case(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[benches::worst_two_and_three(args = [vec![2, 1], vec![3, 2, 1]])]
#[benches::worst_four_to_nine(args = [4, 5, 6, 7, 8, 9], setup = worst_case)]
fn bench_bubble_sort(value: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(value))
}

library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort);
fn main() {
main!(library_benchmark_groups = bubble_sort_group);
}

The library_benchmark_group! macro

The library_benchmark_group macro accepts the following parameters (in this order and separated by a semicolon):

  • name (mandatory): A unique name used to identify the group for the main! macro
  • config (optional): A LibraryBenchmarkConfig which is applied to all benchmarks within the same group.
  • compare_by_id (optional): The default is false. If true, all benches in the benchmark functions specified in the benchmarks parameter are compared with each other as long as the ids (the part after the :: in #[bench::id(...)]) match. See also Comparing benchmark functions
  • setup (optional): A setup function or any valid expression which is run before all benchmarks of this group
  • teardown (optional): A teardown function or any valid expression which is run after all benchmarks of this group
  • benchmarks (mandatory): A list of comma separated paths of benchmark functions which are annotated with #[library_benchmark]

Note the setup and teardown parameters are different to the ones of #[library_benchmark], #[bench] and #[benches]. They accept an expression or function call as in setup = group_setup_function(). Also, these setup and teardown functions are not overridden by the ones from any of the before mentioned attributes.

The main! macro

This macro is the entry point for Iai-Callgrind and creates the benchmark harness. It accepts the following top-level arguments in this order (separated by a semicolon):

  • config (optional): Optionally specify a LibraryBenchmarkConfig
  • setup (optional): A setup function or any valid expression which is run before all benchmarks
  • teardown (optional): A setup function or any valid expression which is run after all benchmarks
  • library_benchmark_groups (mandatory): The name of one or more library benchmark groups. Multiple names are separated by a comma.

Like the setup and teardown of the library_benchmark_group, these parameters accept an expression and are not overridden by the setup and teardown of the library_benchmark_group, #[library_benchmark], #[bench] or #[benches] attribute.

setup and teardown

setup and teardown are your bread and butter in library benchmarks. The benchmark functions need to be as clean as possible and almost always only contain the function call to the function of your library which you want to benchmark.

Setup

In an ideal world you don't need any setup code, and you can pass arguments to the function as they are.

But, for example if a function expects a File and not a &str with the path to the file you need setup code. Iai-Callgrind has an easy-to-use system in place to allow you to run any setup code before the function is executed and this setup code is not attributed to the metrics of the benchmark.

If the setup parameter is specified, the setup function takes the arguments from the #[bench] (or #[benches]) attributes and the benchmark function receives the return value of the setup function as parameter. This is a small indirection with great effect. The effect is best shown with an example:

extern crate iai_callgrind;
mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;
use std::fs::File;

fn open_file(path: &str) -> File {
    File::open(path).unwrap()
}

#[library_benchmark]
#[bench::first(args = ("path/to/file"), setup = open_file)]
fn count_bytes_fast(file: File) -> u64 {
    black_box(my_lib::count_bytes_fast(file))
}

library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast);
fn main() {
main!(library_benchmark_groups = my_group);
}

You can actually see the effect of using a setup function in the output of the benchmark. Let's assume the above benchmark is in a file benches/my_benchmark.rs, then running

IAI_CALLGRIND_NOCAPTURE=true cargo bench

result in the benchmark output like below.

my_benchmark::my_group::count_bytes_fast first:open_file("path/to/file")
  Instructions:             1630162|N/A             (*********)
  L1 Hits:                  2507933|N/A             (*********)
  L2 Hits:                        2|N/A             (*********)
  RAM Hits:                      11|N/A             (*********)
  Total read+write:         2507946|N/A             (*********)
  Estimated Cycles:         2508328|N/A             (*********)

The description in the headline contains open_file("path/to/file"), your setup function open_file with the value of the parameter it is called with.

If you need to specify the same setup function for all (or almost all) #[bench] and #[benches] in a #[library_benchmark] you can use the setup parameter of the #[library_benchmark]:

extern crate iai_callgrind;
mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;
use std::fs::File;
use std::io::{Seek, SeekFrom};

fn open_file(path: &str) -> File {
    File::open(path).unwrap()
}

fn open_file_with_offset(path: &str, offset: u64) -> File {
    let mut file = File::open(path).unwrap();
    file.seek(SeekFrom::Start(offset)).unwrap();
    file
}

#[library_benchmark(setup = open_file)]
#[bench::small("path/to/small")]
#[bench::big("path/to/big")]
#[bench::with_offset(args = ("path/to/big", 100), setup = open_file_with_offset)]
fn count_bytes_fast(file: File) -> u64 {
    black_box(my_lib::count_bytes_fast(file))
}

library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast);
fn main() {
main!(library_benchmark_groups = my_group);
}

The above will use the open_file function in the small and big benchmarks and the open_file_with_offset function in the with_offset benchmark.

Teardown

What about teardown and why should you use it? Usually the teardown isn't needed but for example if you intend to make the result from the benchmark visible in the benchmark output, the teardown is the perfect place to do so.

The teardown function takes the return value of the benchmark function as its argument:

extern crate iai_callgrind;
mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;
use std::fs::File;

fn open_file(path: &str) -> File {
    File::open(path).unwrap()
}

fn print_bytes_read(num_bytes: u64) {
    println!("bytes read: {num_bytes}");
}

#[library_benchmark]
#[bench::first(
    args = ("path/to/big"),
    setup = open_file,
    teardown = print_bytes_read
)]
fn count_bytes_fast(file: File) -> u64 {
    black_box(my_lib::count_bytes_fast(file))
}

library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast);
fn main() {
main!(library_benchmark_groups = my_group);
}

Note Iai-Callgrind captures all output per default. In order to actually see the output of the benchmark, setup and teardown functions, it is required to run the benchmarks with the flag --nocapture or set the environment variable IAI_CALLGRIND_NOCAPTURE=true. Let's assume the above benchmark is in a file benches/my_benchmark.rs, then running

IAI_CALLGRIND_NOCAPTURE=true cargo bench

results in output like the below

my_benchmark::my_group::count_bytes_fast first:open_file("path/to/big")
bytes read: 25078
- end of stdout/stderr
  Instructions:             1630162|N/A             (*********)
  L1 Hits:                  2507931|N/A             (*********)
  L2 Hits:                        2|N/A             (*********)
  RAM Hits:                      13|N/A             (*********)
  Total read+write:         2507946|N/A             (*********)
  Estimated Cycles:         2508396|N/A             (*********)

The output of the teardown function is now visible in the benchmark output above the - end of stdout/stderr line.

Specifying multiple benches at once

Multiple benches can be specified at once with the #[benches] attribute.

The #[benches] attribute in more detail

Let's start with an example:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;
use my_lib::bubble_sort;

fn setup_worst_case_array(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[benches::multiple(vec![1], vec![5])]
#[benches::with_setup(args = [1, 5], setup = setup_worst_case_array)]
fn bench_bubble_sort_with_benches_attribute(input: Vec<i32>) -> Vec<i32> {
    black_box(bubble_sort(input))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort_with_benches_attribute);
fn main () {
main!(library_benchmark_groups = my_group);
}

Usually the arguments are passed directly to the benchmarking function as it can be seen in the #[benches::multiple(/* arguments */)] case. In #[benches::with_setup(/* ... */)], the arguments are passed to the setup function instead. The above #[library_benchmark] is pretty much the same as

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;
use my_lib::bubble_sort;

fn setup_worst_case_array(start: i32) -> Vec<i32> {
    if start.is_negative() {
        (start..0).rev().collect()
    } else {
        (0..start).rev().collect()
    }
}

#[library_benchmark]
#[bench::multiple_0(vec![1])]
#[bench::multiple_1(vec![5])]
#[bench::with_setup_0(setup_worst_case_array(1))]
#[bench::with_setup_1(setup_worst_case_array(5))]
fn bench_bubble_sort_with_benches_attribute(input: Vec<i32>) -> Vec<i32> {
    black_box(bubble_sort(input))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort_with_benches_attribute);
fn main () {
main!(library_benchmark_groups = my_group);
}

but a lot more concise especially if a lot of values are passed to the same setup function.

The file parameter

Reading inputs from a file allows for example sharing the same inputs between different benchmarking frameworks like criterion or if you simply have a long list of inputs you might find it more convenient to read them from a file.

The file parameter, exclusive to the #[benches] attribute, does exactly that and reads the specified file line by line creating a benchmark from each line. The line is passed to the benchmark function as String or if the setup parameter is also present to the setup function. A small example assuming you have a file benches/inputs (relative paths are interpreted to the workspace root) with the following content

1
11
111

then

extern crate iai_callgrind;
mod my_lib { pub fn string_to_u64(value: String) -> Result<u64, String> { Ok(1) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[benches::from_file(file = "benches/inputs")]
fn some_bench(line: String) -> Result<u64, String> {
    black_box(my_lib::string_to_u64(line))
}

library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

The above is roughly equivalent to the following but with the args parameter

extern crate iai_callgrind;
mod my_lib { pub fn string_to_u64(value: String) -> Result<u64, String> { Ok(1) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[benches::from_args(args = [1.to_string(), 11.to_string(), 111.to_string()])]
fn some_bench(line: String) -> Result<u64, String> {
    black_box(my_lib::string_to_u64(line))
}

library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

The true power of the file parameter comes with the setup function because you can format the lines in the file as you like and convert each line in the setup function to the format as you need it in the benchmark. For example if you decided to go with a csv like format in the file benches/inputs

255;255;255
0;0;0

and your library has a function which converts from RGB to HSV color space:

extern crate iai_callgrind;
mod my_lib { pub fn rgb_to_hsv(a: u8, b: u8, c:u8) -> (u16, u8, u8) { (a.into(), b, c) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

fn decode_line(line: String) -> (u8, u8, u8) {
    if let &[a, b, c] = line.split(";")
        .map(|s| s.parse::<u8>().unwrap())
        .collect::<Vec<u8>>()
        .as_slice() 
    {
        (a, b, c)
    } else {
        panic!("Wrong input format in line '{line}'");
    }
}

#[library_benchmark]
#[benches::from_file(file = "benches/inputs", setup = decode_line)]
fn some_bench((a, b, c): (u8, u8, u8)) -> (u16, u8, u8) {
    black_box(my_lib::rgb_to_hsv(black_box(a), black_box(b), black_box(c)))
}

library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

Generic benchmark functions

Benchmark functions can be generic. And setup and teardown functions, too. There's actually not much more to say about it since generic benchmark (setup and teardown) functions behave exactly the same way as you would expect it from any other generic function.

However, there is a common pitfall. If you have a function count_lines_in_file_fast which expects as parameter a PathBuf and although it is convenient especially when you have to specify many paths, don't do this:

extern crate iai_callgrind;
mod my_lib { pub fn count_lines_in_file_fast(_path: std::path::PathBuf) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;

#[library_benchmark]
#[bench::first("path/to/file")]
fn generic_bench<T>(path: T) -> u64 where T: Into<PathBuf> {
    black_box(my_lib::count_lines_in_file_fast(black_box(path.into())))
}

library_benchmark_group!(name = my_group; benchmarks = generic_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

Since path.into() is called in the benchmark function itself, the conversion from a &str to a PathBuf is attributed to the benchmark metrics. This is almost never what you intended. You should instead convert the argument to a PathBuf in a generic setup function like that:

extern crate iai_callgrind;
mod my_lib { pub fn count_lines_in_file_fast(_path: std::path::PathBuf) -> u64 { 1 } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};

use std::hint::black_box;
use std::path::PathBuf;

fn convert_to_pathbuf<T>(path: T) -> PathBuf where T: Into<PathBuf> {
    path.into()
}

#[library_benchmark]
#[bench::first(args = ("path/to/file"), setup = convert_to_pathbuf)]
fn not_generic_anymore(path: PathBuf) -> u64 {
    black_box(my_lib::count_lines_in_file_fast(path))
}

library_benchmark_group!(name = my_group; benchmarks = not_generic_anymore);
fn main() {
main!(library_benchmark_groups = my_group);
}

That way you can still enjoy the convenience to use string literals instead of PathBuf in your #[bench] (or #[benches]) arguments and have clean benchmark metrics.

Comparing benchmark functions

Comparing benchmark functions is supported via the optional library_benchmark_group! argument compare_by_id (The default value for compare_by_id is false). Only benches with the same id are compared, which allows to single out cases which don't need to be compared. In the following example, the case_3 and multiple bench are compared with each other in addition to the usual comparison with the previous run:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

#[library_benchmark]
#[bench::case_3(vec![1, 2, 3])]
#[benches::multiple(args = [vec![1, 2], vec![1, 2, 3, 4]])]
fn bench_bubble_sort_best_case(input: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(input))
}

#[library_benchmark]
#[bench::case_3(vec![3, 2, 1])]
#[benches::multiple(args = [vec![2, 1], vec![4, 3, 2, 1]])]
fn bench_bubble_sort_worst_case(input: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(input))
}

library_benchmark_group!(
    name = bench_bubble_sort;
    compare_by_id = true;
    benchmarks = bench_bubble_sort_best_case, bench_bubble_sort_worst_case
);

fn main() {
main!(library_benchmark_groups = bench_bubble_sort);
}

Note if compare_by_id is true, all benchmark functions are compared with each other, so you are not limited to two benchmark functions per comparison group.

Here's the benchmark output of the above example to see what is happening:

my_benchmark::bubble_sort_group::bubble_sort_best_case case_2:vec! [1, 2]
  Instructions:                  63|N/A             (*********)
  L1 Hits:                       86|N/A             (*********)
  L2 Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:              91|N/A             (*********)
  Estimated Cycles:             231|N/A             (*********)
my_benchmark::bubble_sort_group::bubble_sort_best_case multiple_0:vec! [1, 2, 3]
  Instructions:                  94|N/A             (*********)
  L1 Hits:                      123|N/A             (*********)
  L2 Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             128|N/A             (*********)
  Estimated Cycles:             268|N/A             (*********)
my_benchmark::bubble_sort_group::bubble_sort_best_case multiple_1:vec! [1, 2, 3, 4]
  Instructions:                 136|N/A             (*********)
  L1 Hits:                      174|N/A             (*********)
  L2 Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             179|N/A             (*********)
  Estimated Cycles:             319|N/A             (*********)
my_benchmark::bubble_sort_group::bubble_sort_worst_case case_2:vec! [2, 1]
  Instructions:                  66|N/A             (*********)
  L1 Hits:                       91|N/A             (*********)
  L2 Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:              96|N/A             (*********)
  Estimated Cycles:             236|N/A             (*********)
  Comparison with bubble_sort_best_case case_2:vec! [1, 2]
  Instructions:                  63|66              (-4.54545%) [-1.04762x]
  L1 Hits:                       86|91              (-5.49451%) [-1.05814x]
  L2 Hits:                        1|1               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:              91|96              (-5.20833%) [-1.05495x]
  Estimated Cycles:             231|236             (-2.11864%) [-1.02165x]
my_benchmark::bubble_sort_group::bubble_sort_worst_case multiple_0:vec! [3, 2, 1]
  Instructions:                 103|N/A             (*********)
  L1 Hits:                      138|N/A             (*********)
  L2 Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             143|N/A             (*********)
  Estimated Cycles:             283|N/A             (*********)
  Comparison with bubble_sort_best_case multiple_0:vec! [1, 2, 3]
  Instructions:                  94|103             (-8.73786%) [-1.09574x]
  L1 Hits:                      123|138             (-10.8696%) [-1.12195x]
  L2 Hits:                        1|1               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:             128|143             (-10.4895%) [-1.11719x]
  Estimated Cycles:             268|283             (-5.30035%) [-1.05597x]
my_benchmark::bubble_sort_group::bubble_sort_worst_case multiple_1:vec! [4, 3, 2, 1]
  Instructions:                 154|N/A             (*********)
  L1 Hits:                      204|N/A             (*********)
  L2 Hits:                        1|N/A             (*********)
  RAM Hits:                       4|N/A             (*********)
  Total read+write:             209|N/A             (*********)
  Estimated Cycles:             349|N/A             (*********)
  Comparison with bubble_sort_best_case multiple_1:vec! [1, 2, 3, 4]
  Instructions:                 136|154             (-11.6883%) [-1.13235x]
  L1 Hits:                      174|204             (-14.7059%) [-1.17241x]
  L2 Hits:                        1|1               (No change)
  RAM Hits:                       4|4               (No change)
  Total read+write:             179|209             (-14.3541%) [-1.16760x]
  Estimated Cycles:             319|349             (-8.59599%) [-1.09404x]

The procedure of the comparison algorithm:

  1. Run all benches in the first benchmark function
  2. Run the first bench in the second benchmark function and if there is a bench in the first benchmark function with the same id compare them
  3. Run the second bench in the second benchmark function ...
  4. ...
  5. Run the first bench in the third benchmark function and if there is a bench in the first benchmark function with the same id compare them. If there is a bench with the same id in the second benchmark function compare them.
  6. Run the second bench in the third benchmark function ...
  7. and so on ... until all benches are compared with each other

Neither the order nor the amount of benches within the benchmark functions matters, so it is not strictly necessary to mirror the bench ids of the first benchmark function in the second, third, etc. benchmark function.

Configuration

Library benchmarks can be configured with the LibraryBenchmarkConfig and with Command-line arguments and Environment variables.

The LibraryBenchmarkConfig can be specified at different levels and sets the configuration values for the same and lower levels. The values of the LibraryBenchmarkConfig at higher levels can be overridden at a lower level. Note that some values are additive rather than substitutive. Please see the docs of the respective functions in LibraryBenchmarkConfig for more details.

The different levels where a LibraryBenchmarkConfig can be specified.

  • At top-level with the main! macro
extern crate iai_callgrind;
use iai_callgrind::{library_benchmark, library_benchmark_group};
use iai_callgrind::{main, LibraryBenchmarkConfig};

#[library_benchmark] fn bench() {}
library_benchmark_group!(name = my_group; benchmarks = bench);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default();
    library_benchmark_groups = my_group
);
}
  • At group-level in the library_benchmark_group! macro
extern crate iai_callgrind;
use iai_callgrind::library_benchmark;
use iai_callgrind::{main, LibraryBenchmarkConfig, library_benchmark_group};

#[library_benchmark] fn bench() {}
library_benchmark_group!(
    name = my_group;
    config = LibraryBenchmarkConfig::default();
    benchmarks = bench
);

fn main() {
main!(library_benchmark_groups = my_group);
}
  • At #[library_benchmark] level
extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    main, LibraryBenchmarkConfig, library_benchmark_group, library_benchmark
};
use std::hint::black_box;

#[library_benchmark(config = LibraryBenchmarkConfig::default())] 
fn bench() {
    /* ... */
}

library_benchmark_group!(
    name = my_group;
    config = LibraryBenchmarkConfig::default();
    benchmarks = bench
);

fn main() {
main!(library_benchmark_groups = my_group);
}
  • and at #[bench], #[benches] level
extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    main, LibraryBenchmarkConfig, library_benchmark_group, library_benchmark
};
use std::hint::black_box;

#[library_benchmark] 
#[bench::some_id(args = (1, 2), config = LibraryBenchmarkConfig::default())]
#[benches::multiple(
    args = [(3, 4), (5, 6)], 
    config = LibraryBenchmarkConfig::default()
)]
fn bench(a: u8, b: u8) {
    /* ... */
    _ = (a, b);
}

library_benchmark_group!(
    name = my_group;
    config = LibraryBenchmarkConfig::default();
    benchmarks = bench
);

fn main() {
main!(library_benchmark_groups = my_group);
}

Custom entry points

The EntryPoint can be set to EntryPoint::None which disables the entry point, EntryPoint::Default which uses the benchmark function as entry point or EntryPoint::Custom which will be discussed in more detail in this chapter.

To understand custom entry points let's take a small detour into how Callgrind and Iai-Callgrind work under the hood.

Iai-Callgrind under the hood

Callgrind collects metrics and associates them with a function. This happens based on the compiled code not the source code, so it is possible to hook into any function not only public functions. Callgrind can be configured to switch instrumentation on and off based on a function name with --toggle-collect. Per default, Iai-Callgrind sets this toggle (which we call EntryPoint) to the benchmarking function. Setting the toggle implies --collect-atstart=no. So, all events before (in the setup) and after the benchmark function (in the teardown) are not collected. Somewhat simplified, but conveying the basic idea, here is a commented example:

// <-- collect-at-start=no

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{main,library_benchmark_group, library_benchmark};
use std::hint::black_box;

#[library_benchmark]
fn bench() -> Vec<i32> { // <-- DEFAULT ENTRY POINT starts collecting events
    black_box(my_lib::bubble_sort(vec![3, 2, 1]))
} // <-- stop collecting events

library_benchmark_group!( name = my_group; benchmarks = bench);
fn main() {
main!(library_benchmark_groups = my_group);
}

Pitfall: Inlined functions

The fact that Callgrind acts on the compiled code harbors a pitfall. The compiler with compile-time optimizations switched on (which is usually the case when compiling benchmarks) inlines functions if it sees an advantage in doing so. Iai-Callgrind takes care, that this doesn't happen with the benchmark function, so Callgrind can find and hook into the benchmark function. But, in your production code you actually don't want to stop the compiler from doing its job just to be able to benchmark that function. So, be cautious with benchmarking private functions and only choose functions of which it is known that they are not being inlined.

Hook into private functions

The basic idea is to choose a public function in your library acting as access point to the actual function you want to benchmark. As outlined before, this works only reliably for functions which are not inlined by the compiler.

extern crate iai_callgrind;
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint
};
use std::hint::black_box;

mod my_lib {
     #[inline(never)]
     fn bubble_sort(input: Vec<i32>) -> Vec<i32> {
         // The algorithm
       input
     }

     pub fn access_point(input: Vec<i32>) -> Vec<i32> {
         println!("Doing something before the function call");
         bubble_sort(input)
     }
}

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .entry_point(EntryPoint::Custom("*::my_lib::bubble_sort".to_owned()))
)]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_private(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::access_point(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_private);
fn main() {
main!(library_benchmark_groups = my_group);
}

Note the #[inline(never)] we use in this example to make sure the bubble_sort function is not getting inlined.

We use a wildcard *::my_lib::bubble_sort for EntryPoint::Custom for demonstration purposes. You might want to tighten this pattern. If you don't know how the pattern looks like, use EntryPoint::None first then run the benchmark. Now, investigate the callgrind output file. This output file is pretty low-level but all you need to do is search for the entries which start with fn=.... In the example above this entry might look like fn=algorithms::my_lib::bubble_sort if my_lib would be part of the top-level algorithms module. Or, using grep:

grep '^fn=.*::bubble_sort$' target/iai/the_package/benchmark_file_name/my_group/bench_private.bigger/callgrind.bench_private.bigger.out

Having found the pattern, you can eventually use EntryPoint::Custom.

Multi-threaded and multi-process applications

The default is to run Iai-Callgrind benchmarks with --separate-threads=yes, --trace-children=yes switched on. This enables Iai-Callgrind to trace threads and subprocesses, respectively. Note that --separate-threads=yes is not strictly necessary to be able to trace threads. But, if they are separated, Iai-Callgrind can collect and display the metrics for each thread. Due to the way callgrind applies data collection options like --toggle-collect, --collect-atstart, ... further configuration is needed in library benchmarks.

To actually see the collected metrics in the terminal output for all threads and/or subprocesses you can switch on OutputFormat::show_intermediate:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: u64) -> Vec<u64> { vec![]} }
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

#[library_benchmark]
fn bench_threads() -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(2))
}

library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

The best method for benchmarking threads and subprocesses depends heavily on your code. So, rather than suggesting a single "best" method for benchmarking threads and subprocesses, this chapter will run through various possible approaches and try to highlight the pros and cons of each.

Multi-threaded applications

Callgrind treats each thread and process as a separate unit and it applies data collection options to each unit. In library benchmarks the entry point (or the default toggle) for callgrind is per default set to the benchmark function with the help of the --toggle-collect option. Setting --toggle-collect also automatically sets --collect-atstart=no. If not further customized for a benchmarked multi-threaded function, these options cause the metrics for the spawned threads to be zero. This happens since each thread is a separate unit with --collect-atstart=no and the default toggle applied to the units. The default toggle is set to the benchmark function and does not hook into any function in the thread, so the metrics are zero.

There are multiple ways to customize the default behaviour and actually measure the threads. For the following examples, we're using the benchmark and library code below to show the different customization options assuming this code lives in a benchmark file benches/lib_bench_threads.rs

extern crate iai_callgrind;
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

/// Suppose this is your library
pub mod my_lib {
    /// Return true if `num` is a prime number
    pub fn is_prime(num: u64) -> bool {
        if num <= 1 {
            return false;
        }

        for i in 2..=(num as f64).sqrt() as u64 {
            if num % i == 0 {
                return false;
            }
        }

        true
    }

    /// Find and return all prime numbers in the inclusive range `low` to `high`
    pub fn find_primes(low: u64, high: u64) -> Vec<u64> {
        (low..=high).filter(|n| is_prime(*n)).collect()
    }

    /// Return the prime numbers in the range `0..(num_threads * 10000)`
    pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> {
        let mut handles = vec![];
        let mut low = 0;
        for _ in 0..num_threads {
            let handle = std::thread::spawn(move || find_primes(low, low + 10000));
            handles.push(handle);

            low += 10000;
        }

        let mut primes = vec![];
        for handle in handles {
            let result = handle.join();
            primes.extend(result.unwrap())
        }

        primes
    }
}

#[library_benchmark]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}

library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

Running this benchmark with cargo bench will present you with the following terminal output:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2097219 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66353|N/A                  (*********)
  L2 Hits:                              341|N/A                  (*********)
  RAM Hits:                             539|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86923|N/A                  (*********)
  ## pid: 2097219 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2097219 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66353|N/A                  (*********)
  L2 Hits:                              341|N/A                  (*********)
  RAM Hits:                             539|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86923|N/A                  (*********)

As you can see, the counts for the threads 2 and 3 (our spawned threads) are all zero.

Measuring threads using toggles

At a first glance, setting a toggle to the function in the thread seems to be easiest way and can be done like so:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .callgrind_args(["--toggle-collect=lib_bench_threads::my_lib::find_primes"])
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

This approach may or may not work, depending on whether the compiler inlines the target function of the --toggle-collect argument or not. This is the same problem as with custom entry points. As can be seen below, the compiler has chosen to inline find_primes and the metrics for the threads are still zero:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2620776 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  L2 Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)
  ## pid: 2620776 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2620776 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  L2 Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)

Just to show what would happen if the compiler does not inline the find_primes method, we temporarily annotate it with #[inline(never)]:

#![allow(unused)]
fn main() {
/// Find and return all prime numbers in the inclusive range `low` to `high`
fn is_prime(_: u64) -> bool { true }
#[inline(never)]
pub fn find_primes(low: u64, high: u64) -> Vec<u64> {
    (low..=high).filter(|n| is_prime(*n)).collect()
}
}

Now, running the benchmark does show the desired metrics:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2661917 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  L2 Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)
  ## pid: 2661917 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460503|N/A                  (*********)
  L1 Hits:                          2534938|N/A                  (*********)
  L2 Hits:                               12|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                 2535136|N/A                  (*********)
  Estimated Cycles:                 2541508|N/A                  (*********)
  ## pid: 2661917 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650410|N/A                  (*********)
  L1 Hits:                          3724286|N/A                  (*********)
  L2 Hits:                                4|N/A                  (*********)
  RAM Hits:                             130|N/A                  (*********)
  Total read+write:                 3724420|N/A                  (*********)
  Estimated Cycles:                 3728856|N/A                  (*********)
  ## Total
  Instructions:                     6138285|N/A                  (*********)
  L1 Hits:                          6325655|N/A                  (*********)
  L2 Hits:                              359|N/A                  (*********)
  RAM Hits:                             854|N/A                  (*********)
  Total read+write:                 6326868|N/A                  (*********)
  Estimated Cycles:                 6357340|N/A                  (*********)

But, annotating functions with #[inline(never)] in production code is usually not an option and preventing the compiler from doing its job is not the preferred way to make a benchmark work. The truth is, there is no way to make the --toggle-collect argument work for all cases and it heavily depends on the choices of the compiler depending on your code.

Another way to get the thread metrics is to set --collect-atstart=yes and turn off the EntryPoint:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .entry_point(EntryPoint::None)
        .callgrind_args(["--collect-atstart=yes"])
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

But, the metrics of the main thread will include all the setup (and teardown) code from the benchmark executable (so the instructions of the main thread go up from 27372 to 404425):

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2697019 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                      404425|N/A                  (*********)
  L1 Hits:                           570186|N/A                  (*********)
  L2 Hits:                             1307|N/A                  (*********)
  RAM Hits:                            4856|N/A                  (*********)
  Total read+write:                  576349|N/A                  (*********)
  Estimated Cycles:                  746681|N/A                  (*********)
  ## pid: 2697019 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2466864|N/A                  (*********)
  L1 Hits:                          2543314|N/A                  (*********)
  L2 Hits:                               81|N/A                  (*********)
  RAM Hits:                             409|N/A                  (*********)
  Total read+write:                 2543804|N/A                  (*********)
  Estimated Cycles:                 2558034|N/A                  (*********)
  ## pid: 2697019 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3656729|N/A                  (*********)
  L1 Hits:                          3732802|N/A                  (*********)
  L2 Hits:                               31|N/A                  (*********)
  RAM Hits:                             201|N/A                  (*********)
  Total read+write:                 3733034|N/A                  (*********)
  Estimated Cycles:                 3739992|N/A                  (*********)
  ## Total
  Instructions:                     6528018|N/A                  (*********)
  L1 Hits:                          6846302|N/A                  (*********)
  L2 Hits:                             1419|N/A                  (*********)
  RAM Hits:                            5466|N/A                  (*********)
  Total read+write:                 6853187|N/A                  (*********)
  Estimated Cycles:                 7044707|N/A                  (*********)

Additionally, expect a lot of metric changes if the benchmarks itself are changed. However, if the metrics of the main thread are not significant compared to the total, this might be an applicable (last) choice.

There is another more reliable way as shown below in the next section.

Measuring threads using client requests

The perhaps most reliable and flexible way to measure threads is using client requests. The downside is that you have to put some benchmark code into your production code. But, if you followed the installation instructions in client requests, this additional code is only compiled in benchmarks, not in your final production-ready library.

Using the callgrind client request, we adjust the threads in the find_primes_multi_thread function like so:

#![allow(unused)]
fn main() {
fn find_primes(_a: u64, _b: u64) -> Vec<u64> { vec![] }
extern crate iai_callgrind;
use iai_callgrind::client_requests::callgrind;

/// Return the prime numbers in the range `0..(num_threads * 10000)`
pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> {
    let mut handles = vec![];
    let mut low = 0;
    for _ in 0..num_threads {
        let handle = std::thread::spawn(move || {
            callgrind::toggle_collect();
            let result = find_primes(low, low + 10000);
            callgrind::toggle_collect();
            result
        });
        handles.push(handle);

        low += 10000;
    }

    let mut primes = vec![];
    for handle in handles {
        let result = handle.join();
        primes.extend(result.unwrap())
    }

    primes
}
}

and running the same benchmark now will show the collected metrics of the threads:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2149242 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66352|N/A                  (*********)
  L2 Hits:                              344|N/A                  (*********)
  RAM Hits:                             537|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86867|N/A                  (*********)
  ## pid: 2149242 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460501|N/A                  (*********)
  L1 Hits:                          2534935|N/A                  (*********)
  L2 Hits:                               13|N/A                  (*********)
  RAM Hits:                             185|N/A                  (*********)
  Total read+write:                 2535133|N/A                  (*********)
  Estimated Cycles:                 2541475|N/A                  (*********)
  ## pid: 2149242 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650408|N/A                  (*********)
  L1 Hits:                          3724285|N/A                  (*********)
  L2 Hits:                                1|N/A                  (*********)
  RAM Hits:                             131|N/A                  (*********)
  Total read+write:                 3724417|N/A                  (*********)
  Estimated Cycles:                 3728875|N/A                  (*********)
  ## Total
  Instructions:                     6138214|N/A                  (*********)
  L1 Hits:                          6325572|N/A                  (*********)
  L2 Hits:                              358|N/A                  (*********)
  RAM Hits:                             853|N/A                  (*********)
  Total read+write:                 6326783|N/A                  (*********)
  Estimated Cycles:                 6357217|N/A                  (*********)

Using the client request toggles is very flexible since you can put the iai_callgrind::client_requests::callgrind::toggle_collect instructions anywhere in the threads. In this example, we just have a single function in the thread, but if your threads consist of more than just a single function, you can easily exclude uninteresting parts from the final measurements.

If you want to prevent the code of the main thread from being measured, you can use the following:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .entry_point(EntryPoint::None)
        .callgrind_args(["--collect-atstart=no"])
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

Setting the EntryPoint::None disables the default toggle but also --collect-atstart=no, which is why we have to set the option manually. Altogether, running the benchmark will show:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2251257 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2251257 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460501|N/A                  (*********)
  L1 Hits:                          2534935|N/A                  (*********)
  L2 Hits:                               11|N/A                  (*********)
  RAM Hits:                             187|N/A                  (*********)
  Total read+write:                 2535133|N/A                  (*********)
  Estimated Cycles:                 2541535|N/A                  (*********)
  ## pid: 2251257 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650408|N/A                  (*********)
  L1 Hits:                          3724282|N/A                  (*********)
  L2 Hits:                                4|N/A                  (*********)
  RAM Hits:                             131|N/A                  (*********)
  Total read+write:                 3724417|N/A                  (*********)
  Estimated Cycles:                 3728887|N/A                  (*********)
  ## Total
  Instructions:                     6110909|N/A                  (*********)
  L1 Hits:                          6259217|N/A                  (*********)
  L2 Hits:                               15|N/A                  (*********)
  RAM Hits:                             318|N/A                  (*********)
  Total read+write:                 6259550|N/A                  (*********)
  Estimated Cycles:                 6270422|N/A                  (*********)

Multi-process applications

Measuring multi-process applications is in principal not that different from multi-threaded applications since subprocesses are just like threads separate units. As for threads, the data collection options are applied to subprocesses separately from the main process.

Note there are multiple valgrind command-line arguments that can disable the collection of metrics for uninteresting subprocesses, for example subprocesses that are spawned by your library function but are not part of your library/binary crate.

For the following examples suppose the code below is the cat binary and part of a crate (so we can use env!("CARGO_BIN_EXE_cat")):

use std::fs::File;
use std::io::{copy, stdout, BufReader, BufWriter, Write};

fn main() {
fn main() {
    let mut args_iter = std::env::args().skip(1);
    let file_arg = args_iter.next().expect("File argument should be present");

    let file = File::open(file_arg).expect("Opening file should succeed");
    let stdout = stdout().lock();

    let mut writer = BufWriter::new(stdout);
    copy(&mut BufReader::new(file), &mut writer)
        .expect("Printing file to stdout should succeed");

    writer.flush().expect("Flushing writer should succeed");
}
}

The above binary is a very simple version of cat taking a single file argument. The file content is read and dumped to the stdout. The following is the benchmark and library code to show the different options assuming this code is stored in a benchmark file benches/lib_bench_subprocess.rs

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;

use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    OutputFormat,
};

/// Suppose this is your library
pub mod my_lib {
    use std::io;
    use std::path::Path;
    use std::process::ExitStatus;

    /// A function executing the crate's binary `cat`
    pub fn cat(file: &Path) -> io::Result<ExitStatus> {
        std::process::Command::new(env!("CARGO_BIN_EXE_cat"))
            .arg(file)
            .status()
    }
}

/// Create a file `/tmp/foo.txt` with some content
fn create_file() -> PathBuf {
    let path = PathBuf::from("/tmp/foo.txt");
    std::fs::write(&path, "some content").unwrap();
    path
}

#[library_benchmark]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}

library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

Running the above benchmark with cargo bench results in the following terminal output:

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3141785 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  L2 Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)
  ## pid: 3141786 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  L2 Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)

As expected, the cat subprocess is not measured and the metrics are zero for the same reasons as the initial measurement of threads.

Measuring subprocesses using toggles

The great advantage over measuring threads is that each process has a main function that is not inlined by the compiler and can serve as a reliable hook for the --toggle-collect argument so the following adaption to the above benchmark will just work:

extern crate iai_callgrind;
mod my_lib {
use std::{io, path::Path, process::ExitStatus};
pub fn cat(_: &Path) -> io::Result<ExitStatus> {
   std::process::Command::new("some").status()
}}
fn create_file() -> PathBuf { PathBuf::from("some") }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;
use iai_callgrind::{
   library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
   OutputFormat,
};
#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .callgrind_args(["--toggle-collect=cat::main"])
)]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}
library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(library_benchmark_groups = my_group);
}

producing the desired output

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3324117 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4475|N/A                  (*********)
  L1 Hits:                             6112|N/A                  (*********)
  L2 Hits:                               14|N/A                  (*********)
  RAM Hits:                             187|N/A                  (*********)
  Total read+write:                    6313|N/A                  (*********)
  Estimated Cycles:                   12727|N/A                  (*********)
  ## pid: 3324119 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                        4019|N/A                  (*********)
  L1 Hits:                             5575|N/A                  (*********)
  L2 Hits:                               12|N/A                  (*********)
  RAM Hits:                             167|N/A                  (*********)
  Total read+write:                    5754|N/A                  (*********)
  Estimated Cycles:                   11480|N/A                  (*********)
  ## Total
  Instructions:                        8494|N/A                  (*********)
  L1 Hits:                            11687|N/A                  (*********)
  L2 Hits:                               26|N/A                  (*********)
  RAM Hits:                             354|N/A                  (*********)
  Total read+write:                   12067|N/A                  (*********)
  Estimated Cycles:                   24207|N/A                  (*********)

Measuring subprocesses using client requests

Naturally, client requests can also be used to measure subprocesses. The callgrind client requests are added to the code of the cat binary:

extern crate iai_callgrind;
use std::fs::File;
use std::io::{copy, stdout, BufReader, BufWriter, Write};
use iai_callgrind::client_requests::callgrind;

fn main() {
fn main() {
    let mut args_iter = std::env::args().skip(1);
    let file_arg = args_iter.next().expect("File argument should be present");

    callgrind::toggle_collect();
    let file = File::open(file_arg).expect("Opening file should succeed");
    let stdout = stdout().lock();

    let mut writer = BufWriter::new(stdout);
    copy(&mut BufReader::new(file), &mut writer)
        .expect("Printing file to stdout should succeed");

    writer.flush().expect("Flushing writer should succeed");
    callgrind::toggle_collect();
}
}

For the purpose of this example we decided that measuring the parsing of the command-line-arguments is not interesting for us and excluded it from the collected metrics. The benchmark itself is reverted to its original state without the toggle:

extern crate iai_callgrind;
mod my_lib {
use std::{io, path::Path, process::ExitStatus};
pub fn cat(_: &Path) -> io::Result<ExitStatus> {
   std::process::Command::new("some").status()
}}
fn create_file() -> PathBuf { PathBuf::from("some") }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;
use iai_callgrind::{
   library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
   OutputFormat,
};
#[library_benchmark]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}
library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(library_benchmark_groups = my_group);
}

Now, running the benchmark shows

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3421822 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  L2 Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)
  ## pid: 3421823 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                        2429|N/A                  (*********)
  L1 Hits:                             3406|N/A                  (*********)
  L2 Hits:                                8|N/A                  (*********)
  RAM Hits:                             138|N/A                  (*********)
  Total read+write:                    3552|N/A                  (*********)
  Estimated Cycles:                    8276|N/A                  (*********)
  ## Total
  Instructions:                        6896|N/A                  (*********)
  L1 Hits:                             9508|N/A                  (*********)
  L2 Hits:                               25|N/A                  (*********)
  RAM Hits:                             324|N/A                  (*********)
  Total read+write:                    9857|N/A                  (*********)
  Estimated Cycles:                   20973|N/A                  (*********)

As expected, the metrics for the cat binary are a little bit lower since we skipped measuring the parsing of the command-line arguments.

Even more Examples

I'm referring here to the github repository. We test the library benchmarks functionality of Iai-Callgrind with system tests in the private benchmark-tests package.

Each system test there can serve you as an example, but for a fully documented and commented one see here.

Binary Benchmarks

You want to start benchmarking your crate's binary? Best start with the Quickstart section.

Setting up binary benchmarks is very similar to library benchmarks, and it's a good idea to have a look at the library benchmark section of this guide, too.

You may then come back to the binary benchmarks section and go through the differences

If you need more examples see here.

Important default behaviour

As in library benchmarks, the environment variables are cleared before running a binary benchmark. Have a look at the Configuration section if you want to change this behavior. Iai-Callgrind sometimes deviates from the valgrind defaults which are:

Iai-CallgrindValgrind (v3.23)
--trace-children=yes--trace-children=no
--fair-sched=try--fair-sched=no
--separate-threads=yes--separate-threads=no
--cache-sim=yes--cache-sim=no

As show in the table above, the benchmarks run with cache simulation switched on. This adds run time for each benchmark. If you don't need the cache metrics and estimation of cycles, you can easily switch cache simulation off for example with

#![allow(unused)]
fn main() {
extern crate iai_callgrind;
use iai_callgrind::BinaryBenchmarkConfig;

BinaryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]);
}

To switch off cache simulation for all benchmarks in the same file:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig
};

#[binary_benchmark]
fn bench_binary() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(
    config = BinaryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]);
    binary_benchmark_groups = my_group
);
}

Quickstart

Suppose the crate's binary is called my-foo and this binary takes a file path as positional argument. This first example shows the basic usage of the high-level api with the #[binary_benchmark] attribute:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};

#[binary_benchmark]
#[bench::some_id("foo.txt")]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(
    name = my_group;
    benchmarks = bench_binary
);

fn main() {
main!(binary_benchmark_groups = my_group);
}

If you want to try out this example with your crate's binary, put the above code into a file in $WORKSPACE_ROOT/benches/binary_benchmark.rs. Next, replace my-foo in env!("CARGO_BIN_EXE_my-foo") with the name of a binary of your crate.

Note the env! macro is a rust builtin macro and CARGO_BIN_EXE_<name> is documented here.

You should always use env!("CARGO_BIN_EXE_<name>") to determine the path to the binary of your crate. Do not use relative paths like target/release/my-foo since this might break your benchmarks in many ways. The environment variable does exactly the right thing and the usage is short and simple.

Lastly, adjust the argument of the Command and add the following to your Cargo.toml:

[[bench]]
name = "binary_benchmark"
harness = false

Running

cargo bench

presents you with something like the following:

binary_benchmark::my_group::bench_binary some_id:("foo.txt") -> target/release/my-foo foo.txt
  Instructions:              342129|N/A             (*********)
  L1 Hits:                   457370|N/A             (*********)
  L2 Hits:                      734|N/A             (*********)
  RAM Hits:                    4096|N/A             (*********)
  Total read+write:          462200|N/A             (*********)
  Estimated Cycles:          604400|N/A             (*********)

As opposed to library benchmarks, binary benchmarks have access to a low-level api. Here, pretty much the same as the above high-level usage but written in the low-level api:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{BinaryBenchmark, Bench, binary_benchmark_group, main};

binary_benchmark_group!(
    name = my_group;
    benchmarks = |group: &mut BinaryBenchmarkGroup| {
        group.binary_benchmark(BinaryBenchmark::new("bench_binary")
            .bench(Bench::new("some_id")
                .command(iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
                    .arg("foo.txt")
                    .build()
                )
            )
        )
    }
);

fn main() {
main!(binary_benchmark_groups = my_group);
}

If in doubt, use the high-level api. You can still migrate to the low-level api very easily if you really need to. The other way around is more involved.

Differences to library benchmarks

In this section we're going through the differences to library benchmarks. This assumes that you already know how to set up library benchmarks, and it is recommended to learn the very basics about library benchmarks, starting with Quickstart, Anatomy of a library benchmark and The macros in more detail. Then come back to this section.

Name changes

Coming from library benchmarks, the names with library in it change to the same name but library with binary replaced, so the #[library_benchmark] attribute's name changes to #[binary_benchmark] and library_benchmark_group! changes to binary_benchmark_group!, the config arguments take a BinaryBenchmarkConfig instead of a LibraryBenchmarkConfig...

A quick reference of available macros in binary benchmarks:

  • #[binary_benchmark] and its inner attributes #[bench] and #[benches]: The exact pendant to the #[library_benchmark] attribute macro.
  • binary_benchmark_group!: Just the name of the macro has changed.
  • binary_benchmark_attribute!: An additional macro if you intend to migrate from the high-level to the low-level api
  • main!: The same macro as in library benchmarks but the name of the library_benchmark_groups parameter changed to binary_benchmark_groups.

To see all macros in action have a look at the example below.

The return value of the benchmark function

The maybe most important difference is, that the #[binary_benchmark] annotated function always needs to return an iai_callgrind::Command. Note this function builds the command which is going to be benchmarked but doesn't execute it, yet. So, the code in this function does not attribute to the event counts of the actual benchmark.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};
use std::path::PathBuf;

#[binary_benchmark]
#[bench::foo("foo.txt")]
#[bench::bar("bar.json")]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    // We can put any code in this function which is needed to configure and
    // build the `Command`.
    let path = PathBuf::from(path);

    // Here, if the `path` ends with `.txt` we want to see
    // the `Stdout` output of the `Command` in the benchmark output. In all other 
    // cases, the `Stdout` of the `Command` is redirected to a `File` with the
    // same name as the input `path` but with the extension `out`.
    let stdout = if path.extension().unwrap() == "txt" {
        iai_callgrind::Stdio::Inherit
    } else {
        iai_callgrind::Stdio::File(path.with_extension("out"))
    };
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .stdout(stdout)
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

setup and teardown

Since we can put any code building the Command in the function itself, the setup and teardown of #[binary_benchmark], #[bench] and #[benches] work differently.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};

fn create_file() {
    std::fs::write("foo.txt", "some content").unwrap();
}

#[binary_benchmark]
#[bench::foo(args = ("foo.txt"), setup = create_file())]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

setup, which is here the expression create_file(), is not evaluated right away and the return value of setup is not used as input for the function! Instead, the expression in setup is getting evaluated and executed just before the benchmarked Command is executed. Similarly, teardown is executed after the Command is executed.

In the example above, setup creates always the same file and is pretty static. It's possible to use the same arguments for setup (teardown) and the function using the path (or file pointer) to a function as you're used to from library benchmarks:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main};

fn create_file(path: &str) {
    std::fs::write(path, "some content").unwrap();
}

fn delete_file(path: &str) {
    std::fs::remove_file(path).unwrap();
}

#[binary_benchmark]
// Note the missing parentheses for `setup` of the function `create_file` which
// tells Iai-Callgrind to pass the `args` to the `setup` function AND the
// function `bench_binary`
#[bench::foo(args = ("foo.txt"), setup = create_file)]
// Same for `teardown`
#[bench::bar(args = ("bar.txt"), setup = create_file, teardown = delete_file)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

The Command's stdin and simulating piped input

The behaviour of Stdin of the Command can be changed, almost the same way as the Stdin of a std::process::Command with the only difference, that we use the enums iai_callgrind::Stdin and iai_callgrind::Stdio. These enums provide the variants Inherit (the equivalent of std::process::Stdio::inherit), Pipe (the equivalent of std::process::Stdio::piped) and so on. There's also File which takes a PathBuf to the file which is used as Stdin for the Command. This corresponds to a redirection in the shell as in my-foo < path/to/file.

Moreover, iai_callgrind::Stdin provides the Stdin::Setup variant specific to Iai-Callgrind:

Applications may change their behaviour if the input or the Stdin of the Command is coming from a pipe as in echo "some content" | my-foo. To be able to benchmark such cases, it is possible to use the output of setup to Stdout or Stderr as Stdin for the Command.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{binary_benchmark, binary_benchmark_group, main, Stdin, Pipe};

fn setup_pipe() {
    println!(
        "The output to `Stdout` here will be the input or `Stdin` of the `Command`"
    );
}

#[binary_benchmark]
#[bench::foo(setup = setup_pipe())]
fn bench_binary() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .stdin(Stdin::Setup(Pipe::Stdout))
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

Usually, setup then the Command and then teardown are executed sequentially, each waiting for the previous process to exit successfully (See also Configure the exit code of the Command). If the Command::stdin changes to Stdin::Setup, setup and the Command are executed in parallel and Iai-Callgrind waits first for the Command to exit, then setup. After the successful exit of setup, teardown is executed.

Since setup and Command are run in parallel if Stdin::Setup is used, it is sometimes necessary to delay the execution of the Command. Please see the delay chapter for more details.

Configuration

The configuration of binary benchmarks works the same way as in library benchmarks with the name changing from LibraryBenchmarkConfig to BinaryBenchmarkConfig. Please see there for the basics. However, Binary benchmarks have some additional configuration possibilities:

Delay the Command

Delaying the execution of the Command with Command::delay might be necessary if the setup is executed in parallel either with Command::setup_parallel or Command::stdin set to Stdin::Setup.

For example, if you have a server which needs to be started in the setup to be able to benchmark a client (in our example a crate's binary simply named client):

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use std::net::{SocketAddr, TcpListener};
use std::time::Duration;
use std::thread;

use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, Delay, DelayKind
};

const ADDRESS: &str = "127.0.0.1:31000";

fn setup_tcp_server() {
    println!("Waiting to start server...");
    thread::sleep(Duration::from_millis(300));

    println!("Starting server...");
    let listener = TcpListener::bind(
            ADDRESS.parse::<SocketAddr>().unwrap()
        ).unwrap();

    thread::sleep(Duration::from_secs(1));

    drop(listener);
    println!("Stopped server...");
}

#[binary_benchmark(setup = setup_tcp_server())]
fn bench_client() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_client"))
        .setup_parallel(true)
        .delay(
            Delay::new(DelayKind::TcpConnect(
                ADDRESS.parse::<SocketAddr>().unwrap(),
            ))
            .timeout(Duration::from_millis(500)),
        )
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_client);
fn main() {
main!(binary_benchmark_groups = my_group);
}

The server is started in the parallel setup function setup_tcp_server since Command::setup_parallel is set to true. The delay of the Command is configured with Delay in Command::delay to wait for the tcp connection to be available. We also applied a timeout of 500 milliseconds with Delay::timeout, so if something goes wrong in the server and the tcp connection cannot be established, the benchmark exits with an error after 500 milliseconds instead of hanging forever. After the successful delay, the actual client is executed and benchmarked. After the exit of the client, the setup is waited for to exit successfully. Then, if present, the teardown function is executed.

Please see the library documentation for all possible DelayKinds and more details on the Delay.

Sandbox

The Sandbox is a temporary directory which is created before the execution of the setup and deleted after the teardown. setup, the Command and teardown are executed inside this temporary directory. This simply describes the order of the execution but the setup or teardown don't need to be present.

Why using a Sandbox?

A Sandbox can help mitigating differences in benchmark results on different machines. As long as $TMP_DIR is unset or set to /tmp, the temporary directory has a constant length on unix machines (except android which uses /data/local/tmp). The directory itself is created with a constant length but random name like /tmp/.a23sr8fk.

It is not implausible that an executable has different event counts just because the directory it is executed in has a different length. For example, if a member of your project has set up the project in /home/bob/workspace/our-project running the benchmarks in this directory, and the ci runs the benchmarks in /runner/our-project, the event counts might differ. If possible, the benchmarks should be run in a constant environment. For example clearing the environment variables is also such a measure.

Other good reasons for using a Sandbox are convenience, e.g. if you create files during the setup and Command run and do not want to delete all files manually. Or, maybe more importantly, if the Command is destructive and deletes files, it is usually safer to run such a Command in a temporary directory where it cannot cause damage to your or other file systems.

The Sandbox is deleted after the benchmark, regardless of whether the benchmark run was successful or not. The latter is not guaranteed if you only rely on teardown, as teardown is only executed if the Command returns without error.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox
};

fn create_file(path: &str) {
    std::fs::write(path, "some content").unwrap();
}

#[binary_benchmark]
#[bench::foo(
    args = ("foo.txt"),
    config = BinaryBenchmarkConfig::default().sandbox(Sandbox::new(true)),
    setup = create_file
)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

In this example, as part of the setup, the create_file function with the argument foo.txt is executed in the Sandbox before the Command is executed. The Command is executed in the same Sandbox and therefore the file foo.txt with the content some content exists thanks to the setup. After the execution of the Command, the Sandbox is completely removed, deleting all files created during setup, the Command execution (and teardown if it had been present in this example).

Since setup is run in the sandbox, you can't copy fixtures from your project's workspace into the sandbox that easily anymore. The Sandbox can be configured to copy fixtures into the temporary directory with Sandbox::fixtures:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox
};

#[binary_benchmark]
#[bench::foo(
    args = ("foo.txt"),
    config = BinaryBenchmarkConfig::default()
        .sandbox(Sandbox::new(true)
            .fixtures(["benches/foo.txt"])),
)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

The above will copy the fixture file foo.txt in the benches directory into the sandbox root as foo.txt. Relative paths in Sandbox::fixtures are interpreted relative to the workspace root. In a multi-crate workspace this is the directory with the top-level Cargo.toml file. Paths in Sandbox::fixtures are not limited to files, they can be directories, too.

If you have more complex demands, you can access the workspace root via the environment variable _WORKSPACE_ROOT in setup and teardown. Suppose, there is a fixture located in /home/the_project/foo_crate/benches/fixtures/foo.txt with the_project being the workspace root and foo_crate a workspace member with the my-foo executable. If the command is expected to create a file bar.json, which needs further inspection after the benchmarks have run, let's copy it into a temporary directory tmp (which may or may not exist) in foo_crate:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
    binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox
};
use std::path::PathBuf;

fn copy_fixture(path: &str) {
    let workspace_root = PathBuf::from(std::env::var_os("_WORKSPACE_ROOT").unwrap());
    std::fs::copy(
        workspace_root.join("foo_crate").join("benches").join("fixtures").join(path),
        path
    );
}

// This function will fail if `bar.json` does not exist, which is fine as this
// file is expected to be created by `my-foo`. So, if this file does not exist,
// an error will occur and the benchmark will fail. Although benchmarks are not
// expected to test the correctness of the application, the `teardown` can be
// used to check postconditions for a successful command run.
fn copy_back(path: &str) {
    let workspace_root = PathBuf::from(std::env::var_os("_WORKSPACE_ROOT").unwrap());
    let dest_dir = workspace_root.join("foo_crate").join("tmp");
    if !dest_dir.exists() {
        std::fs::create_dir(&dest_dir).unwrap();
    }
    std::fs::copy(path, dest_dir.join(path));
}

#[binary_benchmark]
#[bench::foo(
    args = ("foo.txt"),
    config = BinaryBenchmarkConfig::default().sandbox(Sandbox::new(true)),
    setup = copy_fixture,
    teardown = copy_back("bar.json")
)]
fn bench_binary(path: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
        .arg(path)
        .build()
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

Configure the exit code of the Command

Usually, if a Command exits with a non-zero exit code, the whole benchmark run fails and stops. If the exit code of the benchmarked Command is to be expected different from 0, the expected exit code can be set in BinaryBenchmarkConfig::exit_with or Command::exit_with:

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
     binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, ExitWith
};

#[binary_benchmark]
// Here, we set the expected exit code of `my-foo` to 2
#[bench::exit_with_2(
    config = BinaryBenchmarkConfig::default().exit_with(ExitWith::Code(2))
)]
// Here, we don't know the exact exit code but know it is different from 0 (=success)
#[bench::exit_with_failure(
    config = BinaryBenchmarkConfig::default().exit_with(ExitWith::Failure)
)]
fn bench_binary() -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
}

binary_benchmark_group!(name = my_group; benchmarks = bench_binary);
fn main() {
main!(binary_benchmark_groups = my_group);
}

Low-level api

I'm not going into full detail of the low-level api here since it is fully documented in the api Documentation.

The basic structure

The entry point of the low-level api is the binary_benchmark_group

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
     binary_benchmark, binary_benchmark_attribute, binary_benchmark_group, main,
     BinaryBenchmark, Bench
};

binary_benchmark_group!(
    name = my_group;
    benchmarks = |group: &mut BinaryBenchmarkGroup| {
        group.binary_benchmark(BinaryBenchmark::new("bench_binary")
            .bench(Bench::new("some_id")
                .command(iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo"))
                    .arg("foo.txt")
                    .build()
                )
            )
        )
    }
);

fn main() {
main!(binary_benchmark_groups = my_group);
}

The low-level api mirrors the high-level api, "structifying" the macros.

The binary_benchmark_group! is also a struct now, the BinaryBenchmarkGroup. It cannot be instantiated. Instead, it is passed as argument to the expression of the benchmarks parameter in a binary_benchmark_group. You can choose any name instead of group, we just have used group throughout the examples.

There's the shorter benchmarks = |group| /* ... */ instead of benchmarks = |group: &mut BinaryBenchmarkGroup| /* ... */. We use the more verbose variant in the examples because it is more informative for benchmarking starters.

Furthermore, the #[library_benchmark] macro correlates with iai_callgrind::LibraryBenchmark and #[bench] with iai_callgrind::Bench. The parameters of the macros are now functions in the respective structs. The return value of the benchmark function, the iai-callgrind::Command, is now also a function iai-callgrind::Bench::command.

Note there is no iai-callgrind::Benches struct since specifying multiple commands with iai_callgrind::Bench::command behaves exactly the same way as the #[benches] attribute. So, the file parameter of #[benches] is a part of iai-callgrind::Bench and can be used with the iai-callgrind::Bench::file function.

Intermixing high-level and low-level api

It is recommended to start with the high-level api using the #[binary_benchmark] attribute, since you can fall back to the low-level api in a few steps with the binary_benchmark_attribute! macro as shown below. The other way around is much more involved.

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use iai_callgrind::{
     binary_benchmark, binary_benchmark_attribute, binary_benchmark_group, main,
     BinaryBenchmark, Bench
};

#[binary_benchmark]
#[bench::some_id("foo")]
fn attribute_benchmark(arg: &str) -> iai_callgrind::Command {
    iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-binary"))
        .arg(arg)
        .build()
}

binary_benchmark_group!(
    name = low_level;
    benchmarks = |group: &mut BinaryBenchmarkGroup| {
        group
            .binary_benchmark(binary_benchmark_attribute!(attribute_benchmark))
            .binary_benchmark(
                BinaryBenchmark::new("low_level_benchmark")
                    .bench(
                        Bench::new("some_id").command(
                            iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-binary"))
                                .arg("bar")
                                .build()
                        )
                    )
            )
    }
);

fn main() {
main!(binary_benchmark_groups = low_level);
}

As shown above, there's no need to transcribe the function attribute_benchmark with the #[binary_benchmark] attribute into the low-level api structures. Just keep it as it is and add it to a the group with group.binary_benchmark(binary_benchmark_attribute(attribute_benchmark)). That's it! You can continue hacking on your benchmarks in the low-level api.

More examples needed?

As in library benchmarks, I'm referring here to the github repository. The binary benchmarks functionality of Iai-Callgrind is tested with system tests in the private benchmark-tests package.

Each system test there can serve you as an example, but for a fully documented and commented one see here.

Performance Regressions

With Iai-Callgrind you can define limits for each event kinds over which a performance regression can be assumed. Per default, Iai-Callgrind does not perform default regression checks, and you have to opt-in with a RegressionConfig at benchmark level with a LibraryBenchmarkConfig or BinaryBenchmarkConfig or at a global level with Command-line arguments or Environment variables.

Define a performance regression

A performance regression check consists of an EventKind and a percentage. If the percentage is negative, then a regression is assumed to be below this limit.

The default EventKind is EventKind::Ir with a value of +10%.

For example, in a Library Benchmark, define a limit of +5% for the total instructions executed (the Ir event kind) in all benchmarks of this file :

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    RegressionConfig, EventKind
};
use std::hint::black_box;

#[library_benchmark]
fn bench_library() -> Vec<i32> {
    black_box(my_lib::bubble_sort(vec![3, 2, 1]))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .regression(
            RegressionConfig::default()
                .limits([(EventKind::Ir, 5.0)])
        );
    library_benchmark_groups = my_group
);
}

Now, if the comparison of the Ir events of the current bench_library benchmark run with the previous run results in an increase of over 5%, the benchmark fails. Please, also have a look at the api docs for further configuration options.

Running the benchmark from above the first time results in the following output:

my_benchmark::my_group::bench_library
  Instructions:                 215|N/A             (*********)
  L1 Hits:                      288|N/A             (*********)
  L2 Hits:                        0|N/A             (*********)
  RAM Hits:                       7|N/A             (*********)
  Total read+write:             295|N/A             (*********)
  Estimated Cycles:             533|N/A             (*********)

Let's assume there's a change in my_lib::bubble_sort which has increased the instruction counts, then running the benchmark again results in an output something similar to this:

my_benchmark::my_group::bench_library
  Instructions:                 281|215             (+30.6977%) [+1.30698x]
  L1 Hits:                      374|288             (+29.8611%) [+1.29861x]
  L2 Hits:                        0|0               (No change)
  RAM Hits:                       8|7               (+14.2857%) [+1.14286x]
  Total read+write:             382|295             (+29.4915%) [+1.29492x]
  Estimated Cycles:             654|533             (+22.7017%) [+1.22702x]
Performance has regressed: Instructions (281 > 215) regressed by +30.6977% (>+5.00000)
iai_callgrind_runner: Error: Performance has regressed.
error: bench failed, to rerun pass `-p the-crate --bench my_benchmark`

Caused by:
  process didn't exit successfully: `/path/to/your/project/target/release/deps/my_benchmark-a9b36fec444944bd --bench` (exit status: 1)
error: Recipe `bench-test` failed on line 175 with exit code 1

Which event to choose to measure performance regressions?

If in doubt, the definite answer is Ir (instructions executed). If Ir event counts decrease noticeable the function (binary) runs faster. The inverse statement is also true: If the Ir counts increase noticeable, there's a slowdown of the function (binary).

These statements are not so easy to transfer to Estimated Cycles and the other event counts. But, depending on the scenario and the function (binary) under test, it can be reasonable to define more regression checks.

Who actually uses instructions to measure performance?

The ones known to the author of this humble guide are

  • SQLite: They use mainly cpu instructions to measure performance improvements (and regressions).
  • Also in benchmarks of the rustc compiler, instruction counts play a great role. But, they also use cache metrics and cycles.

If you know of others, please feel free to add them to this list.

Other Valgrind Tools

In addition to the default benchmarks, you can use the Iai-Callgrind framework to run other Valgrind profiling Tools like DHAT, Massif and the experimental BBV but also Memcheck, Helgrind and DRD if you need to check memory and thread safety of benchmarked code. See also the Valgrind User Manual for more details and command line arguments. The additional tools can be specified in a LibraryBenchmarkConfig or BinaryBenchmarkConfig. For example to run DHAT for all library benchmarks in addition to Callgrind:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, 
    Tool, ValgrindTool
};
use std::hint::black_box;

#[library_benchmark]
fn bench_library() -> Vec<i32> {
    black_box(my_lib::bubble_sort(vec![3, 2, 1]))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .tool(Tool::new(ValgrindTool::DHAT));
    library_benchmark_groups = my_group
);
}

All tools which produce an ERROR SUMMARY (Memcheck, DRD, Helgrind) have --error-exitcode=201 set, so if there are any errors, the benchmark run fails with 201. You can overwrite this default with

#![allow(unused)]
fn main() {
extern crate iai_callgrind;
use iai_callgrind::{Tool, ValgrindTool};

Tool::new(ValgrindTool::Memcheck).args(["--error-exitcode=0"]);
}

which would restore the default of 0 from valgrind.

Valgrind Client Requests

Iai-Callgrind ships with its own interface to the Valgrind's Client Request Mechanism. Iai-Callgrind's client requests have zero overhead (relative to the "C" implementation of Valgrind) on many targets which are also natively supported by valgrind. In short, Iai-Callgrind provides a complete and performant implementation of Valgrind Client Requests.

Installation

Client requests are deactivated by default but can be activated with the client_requests feature.

[dev-dependencies]
iai-callgrind = { version = "0.14.0", features = ["client_requests"] }

If you need the client requests in your production code, you don't want them to do anything when not running under valgrind with Iai-Callgrind benchmarks. You can achieve that by adding Iai-Callgrind with the client_requests_defs feature to your runtime dependencies and with the client_requests feature to your dev-dependencies like so:

[dependencies]
iai-callgrind = { version = "0.14.0", default-features = false, features = [
    "client_requests_defs"
] }

[dev-dependencies]
iai-callgrind = { version = "0.14.0", features = ["client_requests"] }

With just the client_requests_defs feature activated, the client requests compile down to nothing and don't add any overhead to your production code. It simply provides the "definitions", method signatures and macros without body. Only with the activated client_requests feature they will be actually executed. Note that the client requests do not depend on any other part of Iai-Callgrind, so you could even use the client requests without the rest of Iai-Callgrind.

When building Iai-Callgrind with client requests, the valgrind header files must exist in your standard include path (most of the time /usr/include). This is usually the case if you've installed valgrind with your distribution's package manager. If not, you can point the IAI_CALLGRIND_VALGRIND_INCLUDE or IAI_CALLGRIND_<triple>_VALGRIND_INCLUDE environment variables to the include path. So, if the headers can be found in /home/foo/repo/valgrind/{valgrind.h, callgrind.h, ...}, the correct include path would be IAI_CALLGRIND_VALGRIND_INCLUDE=/home/foo/repo (not /home/foo/repo/valgrind)

Usage

Use them in your code for example like so:

extern crate iai_callgrind;
use iai_callgrind::client_requests;

fn main() {
fn main() {
    // Start callgrind event counting if not already started earlier
    client_requests::callgrind::start_instrumentation();

    // do something important

    // Switch event counting off
    client_requests::callgrind::stop_instrumentation();
}
}

Library Benchmarks

In library benchmarks you might need to use EntryPoint::None in order to make the client requests work as expected:

extern crate iai_callgrind;
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

pub mod my_lib {
     #[inline(never)]
     fn bubble_sort(input: Vec<i32>) -> Vec<i32> {
         // The algorithm
       input
     }

     pub fn pre_bubble_sort(input: Vec<i32>) -> Vec<i32> {
         println!("Doing something before the function call");
         iai_callgrind::client_requests::callgrind::start_instrumentation();

         let result = bubble_sort(input);

         iai_callgrind::client_requests::callgrind::stop_instrumentation();
         result
     }
}

#[library_benchmark]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_function(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::pre_bubble_sort(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_function);
fn main() {
main!(library_benchmark_groups = my_group);
}

The default EntryPoint sets the --toggle-collect to the benchmark function (here bench_function) and --collect-at-start=no. So, Callgrind starts collecting the events when entering the benchmark function, not the moment start_instrumentation is called. This behaviour can be remedied with EntryPoint::None:

extern crate iai_callgrind;
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    client_requests, EntryPoint
};
use std::hint::black_box;

pub mod my_lib {
     #[inline(never)]
     fn bubble_sort(input: Vec<i32>) -> Vec<i32> {
         // The algorithm
       input
     }

     pub fn pre_bubble_sort(input: Vec<i32>) -> Vec<i32> {
         println!("Doing something before the function call");
         iai_callgrind::client_requests::callgrind::start_instrumentation();

         let result = bubble_sort(input);

         iai_callgrind::client_requests::callgrind::stop_instrumentation();
         result
     }
}

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .callgrind_args(["--collect-at-start=no"])
        .entry_point(EntryPoint::None)
)]
#[bench::small(vec![3, 2, 1])]
#[bench::bigger(vec![5, 4, 3, 2, 1])]
fn bench_function(array: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::pre_bubble_sort(array))
}

library_benchmark_group!(name = my_group; benchmarks = bench_function);
fn main() {
main!(library_benchmark_groups = my_group);
}

As the standard toggle is now switched off and the option --collect-at-start=no is also omitted, you must specify --collect-at-start=no manually in LibraryBenchmarkConfig::raw_callgrind_args.

Please see the docs for more details!

Callgrind Flamegraphs

Flamegraphs are opt-in and can be created if you pass a FlamegraphConfig to the BinaryBenchmarkConfig or LibraryBenchmarkConfig. Callgrind flamegraphs are meant as a complement to valgrind's visualization tools callgrind_annotate and kcachegrind.

For example create all kind of flamegraphs for all benchmarks in a library benchmark:

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    FlamegraphConfig
};
use std::hint::black_box;

#[library_benchmark]
fn bench_library() -> Vec<i32> {
    black_box(my_lib::bubble_sort(vec![3, 2, 1]))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);

fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .flamegraph(FlamegraphConfig::default());
    library_benchmark_groups = my_group
);
}

The produced flamegraph *.svg files are located next to the respective callgrind output file in the target/iai directory.

Regular Flamegraphs

Regular callgrind flamegraphs show the inclusive costs for functions and a single EventKind (default is EventKind::Ir), similar to callgrind_annotate. Suppose the example from above is stored in a benchmark iai_callgrind_benchmark:

Regular Flamegraph

If you open this image in a new tab, you can play around with the svg.

Differential Flamegraphs

Differential flamegraphs facilitate a deeper understanding of code sections which cause a bottleneck or a performance regressions etc.

Differential Flamegraph

We simulated a small change in bubble_sort and in the differential flamegraph you can spot fairly easily where the increase of Instructions is happening.

(Experimental) Create flamegraphs for multi-threaded/multi-process benchmarks

Note the following only affects flamegraphs of multi-threaded/multi-process benchmarks and benchmarks which produce multiple parts with a total over all sub-metrics.

Currently, Iai-Callgrind creates the flamegraphs only for the total over all threads/parts and subprocesses. This leads to complications since the call graph is not be fully recovered just by examining each thread/subprocess separately. So, the total metrics in the flamegraphs might not be the same as the total metrics shown in the terminal output. If in doubt, the terminal output shows the the correct metrics.

Basic usage

It's possible to pass arguments to Iai-Callgrind separated by -- (cargo bench -- ARGS). If you're running into the error Unrecognized Option, see Troubleshooting. For a complete rundown of possible arguments, execute cargo bench --bench <benchmark> -- --help. Almost all command-line arguments have a corresponding environment variable. The environment variables which don't have a corresponding command-line argument are:

The command-line arguments

High-precision and consistent benchmarking framework/harness for Rust

Boolish command line arguments take also one of `y`, `yes`, `t`, `true`, `on`,
`1`
instead of `true` and one of `n`, `no`, `f`, `false`, `off`, and `0` instead of
`false`

Usage: cargo bench ... [BENCHNAME] -- [OPTIONS]

Arguments:
  [BENCHNAME]
          If specified, only run benches containing this string in their names

          Note that a benchmark name might differ from the benchmark file name.

          [env: IAI_CALLGRIND_FILTER=]

          Options:
      --callgrind-args <CALLGRIND_ARGS>
          The raw arguments to pass through to Callgrind

          This is a space separated list of command-line-arguments specified as
          if they were
          passed directly to valgrind.

          Examples:
            * --callgrind-args=--dump-instr=yes
            * --callgrind-args='--dump-instr=yes --collect-systime=yes'

          [env: IAI_CALLGRIND_CALLGRIND_ARGS=]

      --save-summary[=<SAVE_SUMMARY>]
          Save a machine-readable summary of each benchmark run in json format
          next to the usual benchmark output

          [env: IAI_CALLGRIND_SAVE_SUMMARY=]

          Possible values:
          - json:        The format in a space optimal json representation
          without newlines
          - pretty-json: The format in pretty printed json

      --allow-aslr[=<ALLOW_ASLR>]
          Allow ASLR (Address Space Layout Randomization)

          If possible, ASLR is disabled on platforms that support it (linux,
          freebsd) because ASLR could noise up the callgrind cache simulation results a
          bit. Setting this option to true runs all benchmarks with ASLR enabled.

          See also
          <https://docs.kernel.org/admin-guide/sysctl/kernel.html?
          highlight=randomize_va_space#randomize-va-space>

          [env: IAI_CALLGRIND_ALLOW_ASLR=]
          [possible values: true, false]

      --regression <REGRESSION>
          Set performance regression limits for specific `EventKinds`

          This is a `,` separate list of EventKind=limit (key=value) pairs with
          the limit being a positive or negative percentage. If positive, a performance
          regression check for this `EventKind` fails if the limit is exceeded. If
          negative, the regression check fails if the value comes below the limit. The
          `EventKind` is matched case-insensitive. For a list of valid `EventKinds` see
          the docs:
          <https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.EventKind.html>

          Examples: --regression='ir=0.0' or --regression='ir=0,
          EstimatedCycles=10'

          [env: IAI_CALLGRIND_REGRESSION=]

      --regression-fail-fast[=<REGRESSION_FAIL_FAST>]
          If true, the first failed performance regression check fails the
          whole benchmark run

          This option requires `--regression=...` or
          `IAI_CALLGRIND_REGRESSION=...` to be present.

          [env: IAI_CALLGRIND_REGRESSION_FAIL_FAST=]
          [possible values: true, false]

      --save-baseline[=<SAVE_BASELINE>]
          Compare against this baseline if present and then overwrite it

          [env: IAI_CALLGRIND_SAVE_BASELINE=]

      --baseline[=<BASELINE>]
          Compare against this baseline if present but do not overwrite it

          [env: IAI_CALLGRIND_BASELINE=]

      --load-baseline[=<LOAD_BASELINE>]
          Load this baseline as the new data set instead of creating a new one

          [env: IAI_CALLGRIND_LOAD_BASELINE=]

      --output-format <OUTPUT_FORMAT>
          The terminal output format in default human-readable format or in
          machine-readable json format

          # The JSON Output Format

          The json terminal output schema is the same as the schema with the
          `--save-summary` argument when saving to a `summary.json` file. All other
          output than the json output goes to stderr and only the summary output goes to
          stdout. When not printing pretty json, each line is a dictionary summarizing a
          single benchmark. You can combine all lines (benchmarks) into an array for
          example with `jq`

          `cargo bench -- --output-format=json | jq -s`

          which transforms `{...}\n{...}` into `[{...},{...}]`

          [env: IAI_CALLGRIND_OUTPUT_FORMAT=]
          [default: default]
          [possible values: default, json, pretty-json]

      --separate-targets[=<SEPARATE_TARGETS>]
          Separate iai-callgrind benchmark output files by target

          The default output path for files created by iai-callgrind and
          valgrind during the benchmark is


          `target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID`.

          This can be problematic if you're running the benchmarks not only for
          a single target because you end up comparing the benchmark runs with the wrong
          targets. Setting this option changes the default output path to


          `target/iai/$TARGET/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/
              $BENCH_FUNCTION.$BENCH_ID`

          Although not as comfortable and strict, you could achieve a
          separation by target also with baselines and a combination of
          `--save-baseline=$TARGET` and `--baseline=$TARGET` if you prefer having all
          files of a single $BENCH in the same directory.

          [env: IAI_CALLGRIND_SEPARATE_TARGETS=]
          [default: false]
          [possible values: true, false]

      --home <HOME>
          Specify the home directory of iai-callgrind benchmark output files

          All output files are per default stored under the
          `$PROJECT_ROOT/target/iai` directory. This option lets you customize this
          home directory, and it will be created if it doesn't exist.

          [env: IAI_CALLGRIND_HOME=]

      --nocapture[=<NOCAPTURE>]
          Don't capture terminal output of benchmarks

          Possible values are one of [true, false, stdout, stderr].

          This option is currently restricted to the `callgrind` run of
          benchmarks. The output of additional tool runs like DHAT, Memcheck, ... is
          still captured, to prevent showing the same output of benchmarks multiple
          times. Use `IAI_CALLGRIND_LOG=info` to also show captured and logged output.

          If no value is given, the default missing value is `true` and doesn't
          capture stdout and stderr. Besides `true` or `false` you can specify the
          special values `stdout` or `stderr`. If `--nocapture=stdout` is given, the
          output to `stdout` won't be captured and the output to `stderr` will be
          discarded. Likewise, if `--nocapture=stderr` is specified, the output to
          `stderr` won't be captured and the output to `stdout` will be discarded.

          [env: IAI_CALLGRIND_NOCAPTURE=]
          [default: false]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Comparing with baselines

Usually, two consecutive benchmark runs let Iai-Callgrind compare these two runs. It's sometimes desirable to compare the current benchmark run against a static reference, instead. For example, if you're working longer on the implementation of a feature, you may wish to compare against a baseline from another branch or the commit from which you started off hacking on your new feature to make sure you haven't introduced performance regressions. Iai-Callgrind offers such custom baselines. If you are familiar with criterion.rs, the following command line arguments should also be very familiar to you:

  • --save-baseline=NAME (env: IAI_CALLGRIND_SAVE_BASELINE): Compare against the NAME baseline if present and then overwrite it.
  • --baseline=NAME (env: IAI_CALLGRIND_BASELINE): Compare against the NAME baseline without overwriting it
  • --load-baseline=NAME (env: IAI_CALLGRIND_LOAD_BASELINE): Load the NAME baseline as the new data set instead of creating a new one. This option needs also --baseline=NAME to be present.

If NAME is not present, NAME defaults to default.

For example to create a static reference from the main branch and compare it:

git checkout main
cargo bench --bench <benchmark> -- --save-baseline=main
git checkout feature
# ... HACK ... HACK
cargo bench --bench <benchmark> -- --baseline main

Sticking to the above execution sequence,

cargo bench --bench my_benchmark -- --save-baseline=main

prints something like that with an additional line Baselines in the output.

my_benchmark::my_group::bench_library
  Baselines:                   main|main
  Instructions:                 280|N/A             (*********)
  L1 Hits:                      374|N/A             (*********)
  L2 Hits:                        1|N/A             (*********)
  RAM Hits:                       6|N/A             (*********)
  Total read+write:             381|N/A             (*********)
  Estimated Cycles:             589|N/A             (*********)

After you've made some changes to your code, running

cargo bench --bench my_benchmark -- --baseline=main`

prints something like the following:

my_benchmark::my_group::bench_library
  Baselines:                       |main
  Instructions:                 214|280             (-23.5714%) [-1.30841x]
  L1 Hits:                      287|374             (-23.2620%) [-1.30314x]
  L2 Hits:                        1|1               (No change)
  RAM Hits:                       6|6               (No change)
  Total read+write:             294|381             (-22.8346%) [-1.29592x]
  Estimated Cycles:             502|589             (-14.7708%) [-1.17331x]

Controlling the output of Iai-Callgrind

This section describes command-line options and environment variables which influence the terminal, file and logging output of Iai-Callgrind.

Customize the output directory

All output files of Iai-Callgrind are usually stored using the following scheme:

$WORKSPACE_ROOT/target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID

This directory structure can partly be changed with the following options.

Callgrind Home

Per default, all benchmark output files are stored under the $WORKSPACE_ROOT/target/iai directory tree. This home directory can be changed with the IAI_CALLGRIND_HOME environment variable or the command-line argument --home. The command-line argument overwrites the value of the environment variable. For example to store all files under the /tmp/iai-callgrind directory you can use IAI_CALLGRIND_HOME=/tmp/iai-callgrind or cargo bench -- --home=/tmp/iai-callgrind.

Separate targets

If you're running the benchmarks on different targets, it's necessary to separate the output files of the benchmark runs per target or else you could end up comparing the benchmarks with the wrong target leading to strange results. You can achieve this with different baselines per target, but it's much less painful to separate the output files by target with the --separate-targets command-line argument or setting the environment variable IAI_CALLGRIND_SEPARATE_TARGETS=yes). The output directory structure changes from

target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID

to

target/iai/$TARGET_TRIPLE/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID

For example, assuming the library benchmark file name is bench_file in the package my_package

extern crate iai_callgrind;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use iai_callgrind::{main, library_benchmark_group, library_benchmark};
use std::hint::black_box;

#[library_benchmark]
#[bench::short(vec![4, 3, 2, 1])]
fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> {
    black_box(my_lib::bubble_sort(values))
}

library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort);

fn main() {
main!(library_benchmark_groups = my_group);
}

Without --separate-targets:

target/iai/my_package/bench_file/my_group/bench_bubble_sort.short

and with --separate-targets assuming you're running the benchmark on the x86_64-unknown-linux-gnu target:

target/iai/x86_64-unknown-linux-gnu/my_package/bench_file/my_group/bench_bubble_sort.short

Machine-readable output

With --output-format=default|json|pretty-json (env: IAI_CALLGRIND_OUTPUT_FORMAT) you can change the terminal output format to the machine-readable json format. The json schema fully describing the json output is stored in summary.v2.schema.json. Each line of json output (if not pretty-json) is a summary of a single benchmark, and you may want to combine all benchmarks in an array. You can do so for example with jq

cargo bench -- --output-format=json | jq -s

which transforms {...}\n{...} into [{...},{...}].

Instead of, or in addition to changing the terminal output, it's possible to save a summary file for each benchmark with --save-summary=json|pretty-json (env: IAI_CALLGRIND_SAVE_SUMMARY). The summary.json files are stored next to the usual benchmark output files in the target/iai directory.

Showing terminal output of benchmarks

Per default, all terminal output of the benchmark function, setup and teardown is captured and therefore not shown during a benchmark run.

Using the log level

The most basic possibility to show any captured output, is to use IAI_CALLGRIND_LOG=info. This includes a lot of other output, too.

Tell Iai-Callgrind to not capture the output

Another nicer possibility is, to tell Iai-Callgrind to not capture output with the --nocapture (env: IAI_CALLGRIND_NOCAPTURE) option. This is currently restricted to the callgrind run to prevent showing the same output multiple times. So, any terminal output of other tool runs is still captured.

The --nocapture flag takes the special values stdout and stderr in addition to true and false:

--nocapture=true|false|stdout|stderr

In the --nocapture=stdout case, terminal output to stdout is not captured and shown during the benchmark run but output to stderr is discarded. Likewise, --nocapture=stderr shows terminal output to stderr but discards output to stdout.

Let's take as example a library benchmark benches/my_benchmark.rs

extern crate iai_callgrind;
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;

fn print_to_stderr(value: u64) {
    eprintln!("Error output during teardown: {value}");
}

fn add_10_and_print(value: u64) -> u64 {
    let value = value + 10;
    println!("Output to stdout: {value}");

    value
}

#[library_benchmark]
#[bench::some_id(args = (10), teardown = print_to_stderr)]
fn bench_library(value: u64) -> u64 {
    black_box(add_10_and_print(value))
}

library_benchmark_group!(name = my_group; benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}

If the above benchmark is run with cargo bench --bench my_benchmark -- --nocapture, the output of Iai-Callgrind will look like this:

my_benchmark::my_group::bench_library some_id:10
Output to stdout: 20
Error output during teardown: 20
- end of stdout/stderr
  Instructions:                 851|N/A             (*********)
  L1 Hits:                     1193|N/A             (*********)
  L2 Hits:                        5|N/A             (*********)
  RAM Hits:                      66|N/A             (*********)
  Total read+write:            1264|N/A             (*********)
  Estimated Cycles:            3528|N/A             (*********)

Everything between the headline and the - end of stdout/stderr line is output from your benchmark. The - end of stdout/stderr line changes depending on the options you have given. For example in the --nocapture=stdout case this line indicates your chosen option with - end of stdout.

Note that independently of the value of the --nocapture option, all logging output of a valgrind tool itself is stored in files in the output directory of the benchmark. Since Iai-Callgrind needs the logging output of valgrind tools stored in files, there is no option to disable the creation of these log files. But, if anything goes sideways you might be glad to have the log files around.

Changing the color output

The terminal output is colored per default but follows the value for the IAI_CALLGRIND_COLOR environment variable. If IAI_CALLGRIND_COLOR is not set, CARGO_TERM_COLOR is also tried. Accepted values are:

always, never, auto (default).

So, disabling colors can be achieved with setting IAI_CALLGRIND_COLOR or CARGO_TERM_COLOR=never.

Changing the logging output

Iai-Callgrind uses env_logger and the default logging level WARN. To set the logging level to something different, set the environment variable IAI_CALLGRIND_LOG for example to IAI_CALLGRIND_LOG=DEBUG. Accepted values are:

error, warn (default), info, debug, trace.

The logging output is colored per default but follows the Color settings.

See also the documentation of env_logger.

I'm getting the error Sentinel ... not found

You've most likely disabled creating debug symbols in your cargo bench profile. This can originate in an option you've added to the release profile since the bench profile inherits the release profile. For example, if you've added strip = true to your release profile which is perfectly fine, you need to disable this option in your bench profile to be able to run Iai-Callgrind benchmarks.

See also the Debug Symbols section in Installation/Prerequisites.

Running cargo bench results in an "Unrecognized Option" error

For

cargo bench -- --some-valid-arg

to work you can either specify the benchmark with --bench BENCHMARK, for example

cargo bench --bench my_iai_benchmark -- --callgrind-args="--collect-bus=yes"

or add the following to your Cargo.toml:

[lib]
bench = false

and if you have binaries

[[bin]]
name = "my-binary"
path = "src/bin/my-binary.rs"
bench = false

Setting bench = false disables the creation of the implicit default libtest harness which is added even if you haven't used #[bench] functions in your library or binary. Naturally, the default harness doesn't know of the Iai-Callgrind arguments and aborts execution printing the Unrecognized Option error.

If you cannot or don't want to add bench = false to your Cargo.toml, you can alternatively use environment variables. For every command-line argument exists a corresponding environment variable.

Comparison of Iai-Callgrind with Criterion-rs

This is a comparison with Criterion-rs but some of the points in Pros and Cons also apply to other wall-clock time based benchmarking frameworks.

Iai-Callgrind Pros:

  • Iai-Callgrind can give answers that are repeatable to 7 or more significant digits. In comparison, actual (wall-clock) run times are scarcely repeatable beyond one significant digit.

    This allows to implement and measure "microoptimizations". Typical microoptimizations reduce the number of CPU cycles by 0.1% or 0.05% or even less. Such improvements are impossible to measure with real-world timings. But hundreds or thousands of microoptimizations add up, resulting in measurable real-world performance gains.1

  • Iai-Callgrind can work reliably in noisy environments especially in CI environments from providers like GitHub Actions or Travis-CI, where Criterion-rs cannot.

  • The benchmark api of Iai-Callgrind is simple, intuitive and allows for a much more concise and clearer structure of benchmarks.

  • Iai-Callgrind can benchmark functions in binary crates.

  • Iai-Callgrind can benchmark private functions.

  • Although Callgrind adds runtime overhead, running each benchmark exactly once is still usually much faster than Criterion-rs' statistical measurements.

  • Criterion-rs creates plots and graphs about the averages, median etc. which adds considerable execution time to the execution time for each benchmark. Iai-Callgrind doesn't need any of these plots, since it can collect all its metrics in a single run.

  • Iai-Callgrind generates profile output from the benchmark without further effort.

  • With Iai-Callgrind you have native access to all the possibilities of all Valgrind tools, including Valgrind Client Requests.

Iai-Callgrind/Criterion-rs Mixed:

  • Although it is usually not significant, due to the high precision of the Iai-Callgrind measurements changes in the benchmarks themselves like adding a
    benchmark case can have an effect on the other benchmarks. Iai-Callgrind can only try to reduce these effects to a minimum but never completely eliminate them. Criterion-rs does not have this problem because it cannot detect such small changes.

Iai-Callgrind Cons:

  • Iai-Callgrind's measurements merely correlate with wall-clock time. Wall-clock time is an obvious choice in many cases because it corresponds to what users perceive and Criterion-rs measures it directly.
  • Iai-Callgrind can only be used on platforms supported by Valgrind. Notably, this does not include Windows.
  • Iai-Callgrind needs additional binaries, valgrind and the iai-callgrind-runner. The version of the runner needs to be in sync with the iai-callgrind library. Criterion-rs is only a library and the installation is usually simpler.

Especially, due to the first point in the Cons, I think it is still required to run wall-clock time benchmarks and use Criterion-rs in conjunction with Iai-Callgrind. But in the CI and for performance regression checks, you shouldn't use Criterion-rs or other wall-clock time based benchmarks at all.

Comparison of Iai-Callgrind with Iai

This is a comparison with Iai, from which Iai-Callgrind is forked over a year ago.

Iai-Callgrind Pros:

  • Iai-Callgrind is actively maintained.

  • The benchmark api of Iai-Callgrind is simple, intuitive and allows for a much more concise and clearer structure of benchmarks.

  • More stable metrics because the benchmark function is virtually encapsulated by Callgrind and separates the benchmarked code from the surrounding code.

  • Iai-Callgrind excludes setup code from the metrics natively.

  • The Callgrind output files are much more focused on the benchmark function and the function under test than the Cachegrind output files that Iai produces. The calibration run of Iai only sanitized the visible summary output but not the metrics in the output files themselves. So, the output of cg_annotate was still cluttered by the initialization code, setup functions and metrics.

  • Changes to the library of Iai-Callgrind have almost never an influence on the benchmark metrics, since the actual runner (iai-callgrind-runner) and thus 99% of the code needed to run the benchmarks is isolated from the benchmarks by an independent binary. In contrast to the library of Iai which is compiled together with the benchmarks.

  • Iai-Callgrind has functionality in place that provides a more constant environment, like the Sandbox and clearing environment variables.

  • Supports running other Valgrind Tools, like DHAT, Massif etc.

  • Comparison of benchmark functions.

  • Iai-Callgrind can be configured to check for performance regressions.

  • A complete implementation of Valgrind Client Requests is available in Iai-Callgrind itself.

  • Comparison of benchmarks to baselines instead of only to .old files.

  • Iai-Callgrind natively supports benchmarking binaries.

  • Iai-Callgrind can print machine-readable output in .json format.

I don't see any downside in using Iai-Callgrind instead of Iai.