Introduction
This is the guide for Iai-Callgrind, a benchmarking framework/harness which uses Valgrind's Callgrind and other Valgrind tools like DHAT, Massif, ... to provide extremely accurate and consistent measurements of Rust code, making it perfectly suited to run in environments like a CI.
Iai_Callgrind is fully documented in this guide and in the api documentation at docs.rs.
Iai-Callgrind is
- Precise: High-precision measurements of
Instruction
counts and many other metrics allow you to reliably detect very small optimizations and regressions of your code. - Consistent: Iai-Callgrind can take accurate measurements even in virtualized CI environments and make them comparable between different systems completely negating the noise of the environment.
- Fast: Each benchmark is only run once, which is usually much faster than benchmarks which measure execution and wall-clock time. Benchmarks measuring the wall-clock time have to be run many times to increase their accuracy, detect outliers, filter out noise, etc.
- Visualizable: Iai-Callgrind generates a Callgrind (DHAT, ...) profile of
the benchmarked code and can be configured to create flamegraph-like charts
from Callgrind metrics. In general, all Valgrind-compatible tools like
callgrind_annotate,
kcachegrind or
dh_view.html
and others to analyze the results in detail are fully supported. - Easy: The API for setting up benchmarks is easy to use and allows you to quickly create concise and clear benchmarks. Focus more on profiling and your code than on the framework.
Design philosophy and goals
Iai-Callgrind benchmarks are designed to be runnable with cargo bench
. The
benchmark files are expanded to a benchmarking harness which replaces the native
benchmark harness of rust
. Iai-Callgrind is a profiling framework that can
quickly and reliably detect performance regressions and optimizations even in
noisy environments with a precision that is impossible to achieve with
wall-clock time based benchmarks. At the same time, we want to abstract the
complicated parts and repetitive tasks away and provide an easy to use and
intuitive api. Iai-Callgrind tries to stay out of your way so you can focus more
on profiling and your code!
When not to use Iai-Callgrind
Although Iai-Callgrind is useful in many projects, there are cases where Iai-Callgrind is not a good fit.
- If you need wall-clock times, Iai-Callgrind cannot help you much. The estimation of cpu cycles merely correlates to wall-clock times but is not a replacement for wall-clock times. The cycles estimation is primarily designed to be a relative metric to be used for comparison.
- Iai-Callgrind cannot be run on Windows and platforms not supported by Valgrind.
Improving Iai-Callgrind
No one's perfect!
You want to share your experience with Iai-Callgrind and have a recipe that might be useful for others and fits into this guide? You have an idea for a new feature, are missing a functionality or have found a bug? We would love to here about it. You want to contribute and hack on Iai-Callgrind?
Please don't hesitate to open an issue.
You want to hack on this guide? The source code of this book lives in the docs subdirectory.
Getting Help
Reach out to us on Github Discussions or open an Issue in the Iai-Callgrind Repository. Check the open and closed issues in the issue board, maybe you can already find a solution to your problem there.
The api documentation can be found on
docs.rs but you might
also want to check out the Troubleshooting
section in the sidebar of this
guide.
Prerequisites
In order to use Iai-Callgrind, you must have Valgrind installed. This means that Iai-Callgrind cannot be used on platforms that are not supported by Valgrind.
Debug Symbols
It's required to run the Iai-Callgrind benchmarks with debugging symbols
switched on. For example in your ~/.cargo/config
or your project's
Cargo.toml
:
[profile.bench]
debug = true
Now, all benchmarks which are run with cargo bench
include the debug symbols.
(See also Cargo
Profiles and Cargo
Config).
It's required that settings like strip = true
or other configuration options
stripping the debug symbols need to be disabled explicitly for the bench
profile if you have changed this option for the release
profile. For example:
[profile.release]
strip = true
[profile.bench]
debug = true
strip = false
Valgrind Client Requests
If you want to make use of the mighty Valgrind Client Request
Mechanism
shipped with Iai-Callgrind, you also need libclang
(clang >= 5.0) installed.
See also the requirements of
bindgen and of
cc.
More details on the usage and requirements of Valgrind Client Requests in this chapter of the guide.
Installation of Valgrind
Iai-Callgrind is intentionally independent of a specific version of valgrind.
However, Iai-Callgrind was only tested with versions of valgrind >= 3.20.0
. It
is therefore highly recommended to use a recent version of valgrind. Bugs get
fixed, the supported platforms are expanded ... Also, if you want or need to,
building valgrind from
source
is usually a straight-forward process. Just make sure the valgrind
binary is
in your $PATH
so that Iai-callgrind can find it.
Installation of valgrind with your package manager
Alpine Linux
apk add just
Arch Linux
pacman -Sy valgrind
Debian/Ubuntu
apt-get install valgrind
Fedora Linux
dnf install valgrind
FreeBSD
pkg install valgrind
Valgrind is available for the following distributions
Iai-Callgrind
Iai-Callgrind is divided into the library iai-callgrind
and the benchmark runner
iai-callgrind-runner
.
Installation of the library
To start with Iai-Callgrind, add the following to your Cargo.toml
file:
[dev-dependencies]
iai-callgrind = "0.14.0"
or run
cargo add --dev iai-callgrind@0.14.0
Installation of the benchmark runner
To be able to run the benchmarks you'll also need the iai-callgrind-runner
binary installed somewhere in your $PATH
. Otherwise, there is no need to
interact with iai-callgrind-runner
as it is just an implementation detail.
From Source
cargo install --version 0.14.0 iai-callgrind-runner
There's also the possibility to install the binary somewhere else and point the
IAI_CALLGRIND_RUNNER
environment variable to the absolute path of the
iai-callgrind-runner
binary like so:
cargo install --version 0.14.0 --root /tmp iai-callgrind-runner
IAI_CALLGRIND_RUNNER=/tmp/bin/iai-callgrind-runner cargo bench --bench my-bench
Binstall
The iai-callgrind-runner
binary is
pre-built
for most platforms supported by valgrind and easily installable with
binstall
cargo binstall iai-callgrind-runner@0.14.0
Updating
When updating the iai-callgrind
library, you'll also need to update
iai-callgrind-runner
and vice-versa or else the benchmark runner will exit
with an error.
In the Github CI
Since the iai-callgrind-runner
version must match the iai-callgrind
library
version it's best to automate this step in the CI. A job step in the github
actions CI could look like this
- name: Install iai-callgrind-runner
run: |
version=$(cargo metadata --format-version=1 |\
jq '.packages[] | select(.name == "iai-callgrind").version' |\
tr -d '"'
)
cargo install iai-callgrind-runner --version $version
Or, speed up the overall installation time with binstall
using the
taiki-e/install-action
- uses: taiki-e/install-action@cargo-binstall
- name: Install iai-callgrind-runner
run: |
version=$(cargo metadata --format-version=1 |\
jq '.packages[] | select(.name == "iai-callgrind").version' |\
tr -d '"'
)
cargo binstall --no-confirm iai-callgrind-runner --version $version
Overview
Iai-Callgrind can be used to benchmark the library and binary of your project's crates. Library and binary benchmarks are treated differently by Iai-Callgrind and cannot be intermixed in the same benchmark file. This is indeed a feature and helps keeping things organized. Having different and multiple benchmark files for library and binary benchmarks is no problem for Iai-Callgrind and is usually a good idea anyway. Having benchmarks for different binaries in the same benchmark file however is fully supported.
Head over to the Quickstart section of library benchmarks if you want to start benchmarking your library functions or to the Quickstart section of binary benchmarks if you want to start benchmarking your crate's binary (binaries).
Binary Benchmarks vs Library Benchmarks
Almost all binary benchmarks can be written as library benchmarks. For example,
if you have a main.rs
file of your binary, which basically looks like this
mod my_lib { pub fn run() {} } use my_lib::run; fn main() { run(); }
you could also choose to benchmark the library function my_lib::run
in a
library benchmark instead of the binary in a binary benchmark. There's no real
downside to either of the benchmark schemes and which scheme you want to use
heavily depends on the structure of your binary. As a maybe obvious rule of
thumb, micro-benchmarks of specific functions should go into library benchmarks
and macro-benchmarks into binary benchmarks. Generally, choose the closest
access point to the program point you actually want to benchmark.
You should always choose binary benchmarks over library benchmarks if you want to benchmark the behaviour of the executable if the input comes from a pipe since this feature is exclusive to binary benchmarks. See The Command's stdin and simulating piped input for more.
Library Benchmarks
You want to dive into benchmarking your library? Best start with the Quickstart section and then go through the examples in the other sections of this guide. If you need more examples see here
Important default behaviour
The environment variables are cleared before running a library benchmark. Have a look into the Configuration section if you need to change that behavior. Iai-Callgrind sometimes deviates from the valgrind defaults which are:
Iai-Callgrind | Valgrind (v3.23) |
---|---|
--trace-children=yes | --trace-children=no |
--fair-sched=try | --fair-sched=no |
--separate-threads=yes | --separate-threads=no |
--cache-sim=yes | --cache-sim=no |
The thread and subprocess specific valgrind options enable tracing threads and subprocesses basically but there's usually some additional configuration necessary to trace the metrics of threads and subprocesses.
As show in the table above, the benchmarks run with cache simulation switched on. This adds run time. If you don't need the cache metrics and estimation of cycles, you can easily switch cache simulation off for example with:
#![allow(unused)] fn main() { extern crate iai_callgrind; use iai_callgrind::LibraryBenchmarkConfig; LibraryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]); }
To switch off cache simulation for all benchmarks in the same file:
extern crate iai_callgrind; mod my_lib { pub fn fibonacci(a: u64) -> u64 { a } } use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig }; use std::hint::black_box; #[library_benchmark] fn bench_fibonacci() -> u64 { black_box(my_lib::fibonacci(10)) } library_benchmark_group!(name = fibonacci_group; benchmarks = bench_fibonacci); fn main() { main!( config = LibraryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]); library_benchmark_groups = fibonacci_group ); }
Quickstart
Create a file $WORKSPACE_ROOT/benches/library_benchmark.rs
and add
[[bench]]
name = "library_benchmark"
harness = false
to your Cargo.toml
. harness = false
, tells cargo
to not use the default
rust benchmarking harness which is important because Iai-Callgrind has an own
benchmarking harness.
Then copy the following content into this file:
extern crate iai_callgrind; use iai_callgrind::{main, library_benchmark_group, library_benchmark}; use std::hint::black_box; fn fibonacci(n: u64) -> u64 { match n { 0 => 1, 1 => 1, n => fibonacci(n - 1) + fibonacci(n - 2), } } #[library_benchmark] #[bench::short(10)] #[bench::long(30)] fn bench_fibonacci(value: u64) -> u64 { black_box(fibonacci(value)) } library_benchmark_group!( name = bench_fibonacci_group; benchmarks = bench_fibonacci ); fn main() { main!(library_benchmark_groups = bench_fibonacci_group); }
Now, that your first library benchmark is set up, you can run it with
cargo bench
and should see something like the below
library_benchmark::bench_fibonacci_group::bench_fibonacci short:10
Instructions: 1734|N/A (*********)
L1 Hits: 2359|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 3|N/A (*********)
Total read+write: 2362|N/A (*********)
Estimated Cycles: 2464|N/A (*********)
library_benchmark::bench_fibonacci_group::bench_fibonacci long:30
Instructions: 26214734|N/A (*********)
L1 Hits: 35638616|N/A (*********)
L2 Hits: 2|N/A (*********)
RAM Hits: 4|N/A (*********)
Total read+write: 35638622|N/A (*********)
Estimated Cycles: 35638766|N/A (*********)
In addition, you'll find the callgrind output and the output of other valgrind
tools in target/iai
, if you want to investigate further with a tool like
callgrind_annotate
etc.
When running the same benchmark again, the output will report the differences
between the current and the previous run. Say you've made change to the
fibonacci
function, then you may see something like this:
library_benchmark::bench_fibonacci_group::bench_fibonacci short:10
Instructions: 2805|1734 (+61.7647%) [+1.61765x]
L1 Hits: 3815|2359 (+61.7211%) [+1.61721x]
L2 Hits: 0|0 (No change)
RAM Hits: 3|3 (No change)
Total read+write: 3818|2362 (+61.6427%) [+1.61643x]
Estimated Cycles: 3920|2464 (+59.0909%) [+1.59091x]
library_benchmark::bench_fibonacci_group::bench_fibonacci long:30
Instructions: 16201597|26214734 (-38.1966%) [-1.61803x]
L1 Hits: 22025876|35638616 (-38.1966%) [-1.61803x]
L2 Hits: 2|2 (No change)
RAM Hits: 4|4 (No change)
Total read+write: 22025882|35638622 (-38.1966%) [-1.61803x]
Estimated Cycles: 22026026|35638766 (-38.1964%) [-1.61803x]
Anatomy of a library benchmark
We're reusing our example from the Quickstart section.
extern crate iai_callgrind; use iai_callgrind::{main, library_benchmark_group, library_benchmark}; use std::hint::black_box; fn fibonacci(n: u64) -> u64 { match n { 0 => 1, 1 => 1, n => fibonacci(n - 1) + fibonacci(n - 2), } } #[library_benchmark] #[bench::short(10)] #[bench::long(30)] fn bench_fibonacci(value: u64) -> u64 { black_box(fibonacci(value)) } library_benchmark_group!( name = bench_fibonacci_group; benchmarks = bench_fibonacci ); fn main() { main!(library_benchmark_groups = bench_fibonacci_group); }
First of all, you need a public function in your library which you want to
benchmark. In this example this is the fibonacci
function which, for the sake
of simplicity, lives in the benchmark file itself but doesn't have to. If it
had been located in my_lib::fibonacci
, you simply import that function
with use my_lib::fibonacci
and go on as shown above. Next, you need a
library_benchmark_group!
in which you specify the names of the benchmark
functions. Finally, the benchmark harness is created by the main!
macro.
The benchmark function
The benchmark function has to be annotated with the
#[library_benchmark]
attribute. The
#[bench]
attribute is an inner attribute of the
#[library_benchmark]
attribute. It consists of a mandatory id (the ID
part
in #[bench::ID(/* ... */)]
) and in its most basic form, an optional list of
arguments which are passed to the benchmark function as parameters. Naturally,
the parameters of the benchmark function must match the argument list of the
#[bench]
attribute. It is always a good idea to return something from the
benchmark function, here it is the computed u64
value from the fibonacci
function wrapped in a black_box
. See the docs of
std::hint::black_box
for more information about its usage. Simply put, all values and variables in
the benchmarking function (but not in your library function) need to be wrapped
in a black_box
except for the input parameters (here value
) because
Iai-Callgrind already does that. But, it is no error to black_box
the value
again.
The bench
attribute takes any expression which includes function calls. The
following would have worked too and is one way to avoid the costs of the setup
code being attributed to the benchmarked function.
extern crate iai_callgrind; use iai_callgrind::{main, library_benchmark_group, library_benchmark}; use std::hint::black_box; fn some_setup_func(value: u64) -> u64 { value + 10 } fn fibonacci(n: u64) -> u64 { match n { 0 => 1, 1 => 1, n => fibonacci(n - 1) + fibonacci(n - 2), } } #[library_benchmark] #[bench::short(10)] // Note the usage of the `some_setup_func` in the argument list of this #[bench] #[bench::long(some_setup_func(20))] fn bench_fibonacci(value: u64) -> u64 { black_box(fibonacci(value)) } library_benchmark_group!( name = bench_fibonacci_group; benchmarks = bench_fibonacci ); fn main() { main!(library_benchmark_groups = bench_fibonacci_group); }
The perhaps most crucial part in setting up library benchmarks is to keep the body of benchmark functions clean from any setup or teardown code. There are other ways to avoid setup and teardown code in the benchmark function, which are discussed in full detail in the setup and teardown section.
The group
The name of the benchmark functions, here the only benchmark function
bench_fibonacci
, which should be benchmarked need to be specified in a
library_benchmark_group!
in the benchmarks
parameter. You can create as many
groups as you like, and you can use it to organize related benchmarks. Each group
needs a unique name
.
The main macro
Each group you want to be benchmarked needs to be specified in the
library_benchmark_groups
parameter of the main!
macro and you're all set.
The macros in more detail
This section is a brief reference to all the macros available in library benchmarks. Feel free to come back here from other sections if you need a reference. For the complete documentation of each macro see the api Documentation.
For the following examples it is assumed that there is a file lib.rs
in a
crate named my_lib
with the following content:
#![allow(unused)] fn main() { pub fn bubble_sort(mut array: Vec<i32>) -> Vec<i32> { for i in 0..array.len() { for j in 0..array.len() - i - 1 { if array[j + 1] < array[j] { array.swap(j, j + 1); } } } array } }
The #[library_benchmark]
attribute
This attribute needs to be present on all benchmark functions specified in the
library_benchmark_group
. The benchmark
function can then be further annotated with the inner
#[bench]
or #[benches]
attributes.
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; #[library_benchmark] #[bench::one(vec![1])] #[benches::multiple(vec![1, 2], vec![1, 2, 3], vec![1, 2, 3, 4])] fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(values)) } library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort); fn main() { main!(library_benchmark_groups = bubble_sort_group); }
The following parameters are accepted:
config
: Takes aLibraryBenchmarkConfig
setup
: A global setup function which is applied to all following#[bench]
and#[benches]
attributes if not overwritten by asetup
parameter of these attributes.teardown
: Similar tosetup
but takes a globalteardown
function.
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } } use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, OutputFormat }; use std::hint::black_box; #[library_benchmark( config = LibraryBenchmarkConfig::default() .output_format(OutputFormat::default() .truncate_description(None) ) )] #[bench::one(vec![1])] fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(values)) } library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort); fn main() { main!(library_benchmark_groups = bubble_sort_group); }
The #[bench]
attribute
The basic structure is #[bench::some_id(/* parameters */)]
. The part after the
::
must be an id unique within the same #[library_benchmark]
. This attribute
accepts the following parameters:
args
: A tuple with a list of arguments which are passed to the benchmark function. The parentheses also need to be present if there is only a single argument (#[bench::my_id(args = (10))]
).config
: Accepts aLibraryBenchmarkConfig
setup
: A function which takes the arguments specified in theargs
parameter and passes its return value to the benchmark function.teardown
: A function which takes the return value of the benchmark function.
If no other parameters besides args
are present you can simply pass the
arguments as a list of values. So, instead of #[bench::my_id(args = (10, 20))]
, you could also use the shorter #[bench::my_id(10, 20)]
.
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } } use iai_callgrind::{library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig}; use std::hint::black_box; // This function is used to create a worst case array we want to sort with our implementation of // bubble sort pub fn worst_case(start: i32) -> Vec<i32> { if start.is_negative() { (start..0).rev().collect() } else { (0..start).rev().collect() } } #[library_benchmark] #[bench::one(vec![1])] #[bench::worst_two(args = (vec![2, 1]))] #[bench::worst_four(args = (4), setup = worst_case)] fn bench_bubble_sort(value: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(value)) } library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort); fn main() { main!(library_benchmark_groups = bubble_sort_group); }
The #[benches]
attribute
This attribute is used to specify multiple benchmarks at once. It accepts the
same parameters as the #[bench]
attribute: args
,
config
, setup
and teardown
and additionally the file
parameter which is
explained in detail here. In contrast to the args
parameter in #[bench]
, args
takes an array of
arguments.
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } } use iai_callgrind::{library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig}; use std::hint::black_box; pub fn worst_case(start: i32) -> Vec<i32> { if start.is_negative() { (start..0).rev().collect() } else { (0..start).rev().collect() } } #[library_benchmark] #[benches::worst_two_and_three(args = [vec![2, 1], vec![3, 2, 1]])] #[benches::worst_four_to_nine(args = [4, 5, 6, 7, 8, 9], setup = worst_case)] fn bench_bubble_sort(value: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(value)) } library_benchmark_group!(name = bubble_sort_group; benchmarks = bench_bubble_sort); fn main() { main!(library_benchmark_groups = bubble_sort_group); }
The library_benchmark_group! macro
The library_benchmark_group
macro accepts the following parameters (in this
order and separated by a semicolon):
name
(mandatory): A unique name used to identify the group for themain!
macroconfig
(optional): ALibraryBenchmarkConfig
which is applied to all benchmarks within the same group.compare_by_id
(optional): The default is false. If true, all benches in the benchmark functions specified in thebenchmarks
parameter are compared with each other as long as the ids (the part after the::
in#[bench::id(...)]
) match. See also Comparing benchmark functionssetup
(optional): A setup function or any valid expression which is run before all benchmarks of this groupteardown
(optional): A teardown function or any valid expression which is run after all benchmarks of this groupbenchmarks
(mandatory): A list of comma separated paths of benchmark functions which are annotated with#[library_benchmark]
Note the setup
and teardown
parameters are different to the ones of
#[library_benchmark]
, #[bench]
and #[benches]
. They accept an expression
or function call as in setup = group_setup_function()
. Also, these setup
and
teardown
functions are not overridden by the ones from any of the before
mentioned attributes.
The main! macro
This macro is the entry point for Iai-Callgrind and creates the benchmark harness. It accepts the following top-level arguments in this order (separated by a semicolon):
config
(optional): Optionally specify aLibraryBenchmarkConfig
setup
(optional): A setup function or any valid expression which is run before all benchmarksteardown
(optional): A setup function or any valid expression which is run after all benchmarkslibrary_benchmark_groups
(mandatory): The name of one or more library benchmark groups. Multiple names are separated by a comma.
Like the setup
and teardown
of the
library_benchmark_group
, these
parameters accept an expression and are not overridden by the setup
and
teardown
of the library_benchmark_group
, #[library_benchmark]
, #[bench]
or #[benches]
attribute.
setup and teardown
setup
and teardown
are your bread and butter in library benchmarks. The
benchmark functions need to be as clean as possible and almost always only
contain the function call to the function of your library which you want to
benchmark.
Setup
In an ideal world you don't need any setup code, and you can pass arguments to the function as they are.
But, for example if a function expects a File
and not a &str
with the path
to the file you need setup
code. Iai-Callgrind has an easy-to-use system in
place to allow you to run any setup code before the function is executed and
this setup
code is not attributed to the metrics of the benchmark.
If the setup
parameter is specified, the setup
function takes the arguments
from the #[bench]
(or #[benches]
) attributes and the benchmark function
receives the return value of the setup
function as parameter. This is a small
indirection with great effect. The effect is best shown with an example:
extern crate iai_callgrind; mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; use std::path::PathBuf; use std::fs::File; fn open_file(path: &str) -> File { File::open(path).unwrap() } #[library_benchmark] #[bench::first(args = ("path/to/file"), setup = open_file)] fn count_bytes_fast(file: File) -> u64 { black_box(my_lib::count_bytes_fast(file)) } library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast); fn main() { main!(library_benchmark_groups = my_group); }
You can actually see the effect of using a setup function in the output of the
benchmark. Let's assume the above benchmark is in a file
benches/my_benchmark.rs
, then running
IAI_CALLGRIND_NOCAPTURE=true cargo bench
result in the benchmark output like below.
my_benchmark::my_group::count_bytes_fast first:open_file("path/to/file")
Instructions: 1630162|N/A (*********)
L1 Hits: 2507933|N/A (*********)
L2 Hits: 2|N/A (*********)
RAM Hits: 11|N/A (*********)
Total read+write: 2507946|N/A (*********)
Estimated Cycles: 2508328|N/A (*********)
The description in the headline contains open_file("path/to/file")
, your setup
function open_file
with the value of the parameter it is called with.
If you need to specify the same setup
function for all (or almost all)
#[bench]
and #[benches]
in a #[library_benchmark]
you can use the setup
parameter of the #[library_benchmark]
:
extern crate iai_callgrind; mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; use std::path::PathBuf; use std::fs::File; use std::io::{Seek, SeekFrom}; fn open_file(path: &str) -> File { File::open(path).unwrap() } fn open_file_with_offset(path: &str, offset: u64) -> File { let mut file = File::open(path).unwrap(); file.seek(SeekFrom::Start(offset)).unwrap(); file } #[library_benchmark(setup = open_file)] #[bench::small("path/to/small")] #[bench::big("path/to/big")] #[bench::with_offset(args = ("path/to/big", 100), setup = open_file_with_offset)] fn count_bytes_fast(file: File) -> u64 { black_box(my_lib::count_bytes_fast(file)) } library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast); fn main() { main!(library_benchmark_groups = my_group); }
The above will use the open_file
function in the small
and big
benchmarks
and the open_file_with_offset
function in the with_offset
benchmark.
Teardown
What about teardown
and why should you use it? Usually the teardown
isn't
needed but for example if you intend to make the result from the benchmark
visible in the benchmark output, the teardown
is the perfect place to do so.
The teardown
function takes the return value of the benchmark function as its
argument:
extern crate iai_callgrind; mod my_lib { pub fn count_bytes_fast(_file: std::fs::File) -> u64 { 1 } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; use std::path::PathBuf; use std::fs::File; fn open_file(path: &str) -> File { File::open(path).unwrap() } fn print_bytes_read(num_bytes: u64) { println!("bytes read: {num_bytes}"); } #[library_benchmark] #[bench::first( args = ("path/to/big"), setup = open_file, teardown = print_bytes_read )] fn count_bytes_fast(file: File) -> u64 { black_box(my_lib::count_bytes_fast(file)) } library_benchmark_group!(name = my_group; benchmarks = count_bytes_fast); fn main() { main!(library_benchmark_groups = my_group); }
Note Iai-Callgrind captures all output per default. In order to actually see the
output of the benchmark, setup
and teardown
functions, it is required to run
the benchmarks with the flag --nocapture
or set the environment variable
IAI_CALLGRIND_NOCAPTURE=true
. Let's assume the above benchmark is in a file
benches/my_benchmark.rs
, then running
IAI_CALLGRIND_NOCAPTURE=true cargo bench
results in output like the below
my_benchmark::my_group::count_bytes_fast first:open_file("path/to/big")
bytes read: 25078
- end of stdout/stderr
Instructions: 1630162|N/A (*********)
L1 Hits: 2507931|N/A (*********)
L2 Hits: 2|N/A (*********)
RAM Hits: 13|N/A (*********)
Total read+write: 2507946|N/A (*********)
Estimated Cycles: 2508396|N/A (*********)
The output of the teardown
function is now visible in the benchmark output
above the - end of stdout/stderr
line.
Specifying multiple benches at once
Multiple benches can be specified at once with the
#[benches]
attribute.
The #[benches]
attribute in more detail
Let's start with an example:
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; use my_lib::bubble_sort; fn setup_worst_case_array(start: i32) -> Vec<i32> { if start.is_negative() { (start..0).rev().collect() } else { (0..start).rev().collect() } } #[library_benchmark] #[benches::multiple(vec![1], vec![5])] #[benches::with_setup(args = [1, 5], setup = setup_worst_case_array)] fn bench_bubble_sort_with_benches_attribute(input: Vec<i32>) -> Vec<i32> { black_box(bubble_sort(input)) } library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort_with_benches_attribute); fn main () { main!(library_benchmark_groups = my_group); }
Usually the arguments
are passed directly to the benchmarking function as it
can be seen in the #[benches::multiple(/* arguments */)]
case. In
#[benches::with_setup(/* ... */)]
, the arguments are passed to the setup
function
instead. The above #[library_benchmark]
is pretty much the same as
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(value: Vec<i32>) -> Vec<i32> { value } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; use my_lib::bubble_sort; fn setup_worst_case_array(start: i32) -> Vec<i32> { if start.is_negative() { (start..0).rev().collect() } else { (0..start).rev().collect() } } #[library_benchmark] #[bench::multiple_0(vec![1])] #[bench::multiple_1(vec![5])] #[bench::with_setup_0(setup_worst_case_array(1))] #[bench::with_setup_1(setup_worst_case_array(5))] fn bench_bubble_sort_with_benches_attribute(input: Vec<i32>) -> Vec<i32> { black_box(bubble_sort(input)) } library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort_with_benches_attribute); fn main () { main!(library_benchmark_groups = my_group); }
but a lot more concise especially if a lot of values are passed to the same
setup
function.
The file
parameter
Reading inputs from a file allows for example sharing the same inputs between
different benchmarking frameworks like criterion
or if you simply have a long
list of inputs you might find it more convenient to read them from a file.
The file
parameter, exclusive to the #[benches]
attribute, does exactly that
and reads the specified file line by line creating a benchmark from each line.
The line is passed to the benchmark function as String
or if the setup
parameter is also present to the setup
function. A small example assuming you
have a file benches/inputs
(relative paths are interpreted to the workspace
root) with the following content
1
11
111
then
extern crate iai_callgrind;
mod my_lib { pub fn string_to_u64(value: String) -> Result<u64, String> { Ok(1) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;
#[library_benchmark]
#[benches::from_file(file = "benches/inputs")]
fn some_bench(line: String) -> Result<u64, String> {
black_box(my_lib::string_to_u64(line))
}
library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}
The above is roughly equivalent to the following but with the args
parameter
extern crate iai_callgrind; mod my_lib { pub fn string_to_u64(value: String) -> Result<u64, String> { Ok(1) } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; #[library_benchmark] #[benches::from_args(args = [1.to_string(), 11.to_string(), 111.to_string()])] fn some_bench(line: String) -> Result<u64, String> { black_box(my_lib::string_to_u64(line)) } library_benchmark_group!(name = my_group; benchmarks = some_bench); fn main() { main!(library_benchmark_groups = my_group); }
The true power of the file
parameter comes with the setup
function because
you can format the lines in the file as you like and convert each line in the
setup
function to the format as you need it in the benchmark. For example if
you decided to go with a csv like format in the file benches/inputs
255;255;255
0;0;0
and your library has a function which converts from RGB to HSV color space:
extern crate iai_callgrind;
mod my_lib { pub fn rgb_to_hsv(a: u8, b: u8, c:u8) -> (u16, u8, u8) { (a.into(), b, c) } }
use iai_callgrind::{library_benchmark, library_benchmark_group, main};
use std::hint::black_box;
fn decode_line(line: String) -> (u8, u8, u8) {
if let &[a, b, c] = line.split(";")
.map(|s| s.parse::<u8>().unwrap())
.collect::<Vec<u8>>()
.as_slice()
{
(a, b, c)
} else {
panic!("Wrong input format in line '{line}'");
}
}
#[library_benchmark]
#[benches::from_file(file = "benches/inputs", setup = decode_line)]
fn some_bench((a, b, c): (u8, u8, u8)) -> (u16, u8, u8) {
black_box(my_lib::rgb_to_hsv(black_box(a), black_box(b), black_box(c)))
}
library_benchmark_group!(name = my_group; benchmarks = some_bench);
fn main() {
main!(library_benchmark_groups = my_group);
}
Generic benchmark functions
Benchmark functions can be generic. And setup
and teardown
functions, too.
There's actually not much more to say about it since generic benchmark (setup
and teardown
) functions behave exactly the same way as you would expect it
from any other generic function.
However, there is a common pitfall. If you have a function
count_lines_in_file_fast
which expects as parameter a PathBuf
and although
it is convenient especially when you have to specify many paths, don't do this:
extern crate iai_callgrind; mod my_lib { pub fn count_lines_in_file_fast(_path: std::path::PathBuf) -> u64 { 1 } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; use std::path::PathBuf; #[library_benchmark] #[bench::first("path/to/file")] fn generic_bench<T>(path: T) -> u64 where T: Into<PathBuf> { black_box(my_lib::count_lines_in_file_fast(black_box(path.into()))) } library_benchmark_group!(name = my_group; benchmarks = generic_bench); fn main() { main!(library_benchmark_groups = my_group); }
Since path.into()
is called in the benchmark function itself, the conversion
from a &str
to a PathBuf
is attributed to the benchmark metrics. This is
almost never what you intended. You should instead convert the argument to a
PathBuf
in a generic setup
function like that:
extern crate iai_callgrind; mod my_lib { pub fn count_lines_in_file_fast(_path: std::path::PathBuf) -> u64 { 1 } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; use std::path::PathBuf; fn convert_to_pathbuf<T>(path: T) -> PathBuf where T: Into<PathBuf> { path.into() } #[library_benchmark] #[bench::first(args = ("path/to/file"), setup = convert_to_pathbuf)] fn not_generic_anymore(path: PathBuf) -> u64 { black_box(my_lib::count_lines_in_file_fast(path)) } library_benchmark_group!(name = my_group; benchmarks = not_generic_anymore); fn main() { main!(library_benchmark_groups = my_group); }
That way you can still enjoy the convenience to use string literals instead of
PathBuf
in your #[bench]
(or #[benches]
) arguments and have clean
benchmark metrics.
Comparing benchmark functions
Comparing benchmark functions is supported via the optional
library_benchmark_group!
argument compare_by_id
(The default value for
compare_by_id
is false
). Only benches with the same id
are compared, which
allows to single out cases which don't need to be compared. In the following
example, the case_3
and multiple
bench are compared with each other in
addition to the usual comparison with the previous run:
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; #[library_benchmark] #[bench::case_3(vec![1, 2, 3])] #[benches::multiple(args = [vec![1, 2], vec![1, 2, 3, 4]])] fn bench_bubble_sort_best_case(input: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(input)) } #[library_benchmark] #[bench::case_3(vec![3, 2, 1])] #[benches::multiple(args = [vec![2, 1], vec![4, 3, 2, 1]])] fn bench_bubble_sort_worst_case(input: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(input)) } library_benchmark_group!( name = bench_bubble_sort; compare_by_id = true; benchmarks = bench_bubble_sort_best_case, bench_bubble_sort_worst_case ); fn main() { main!(library_benchmark_groups = bench_bubble_sort); }
Note if compare_by_id
is true
, all benchmark functions are compared with
each other, so you are not limited to two benchmark functions per comparison
group.
Here's the benchmark output of the above example to see what is happening:
my_benchmark::bubble_sort_group::bubble_sort_best_case case_2:vec! [1, 2]
Instructions: 63|N/A (*********)
L1 Hits: 86|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 4|N/A (*********)
Total read+write: 91|N/A (*********)
Estimated Cycles: 231|N/A (*********)
my_benchmark::bubble_sort_group::bubble_sort_best_case multiple_0:vec! [1, 2, 3]
Instructions: 94|N/A (*********)
L1 Hits: 123|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 4|N/A (*********)
Total read+write: 128|N/A (*********)
Estimated Cycles: 268|N/A (*********)
my_benchmark::bubble_sort_group::bubble_sort_best_case multiple_1:vec! [1, 2, 3, 4]
Instructions: 136|N/A (*********)
L1 Hits: 174|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 4|N/A (*********)
Total read+write: 179|N/A (*********)
Estimated Cycles: 319|N/A (*********)
my_benchmark::bubble_sort_group::bubble_sort_worst_case case_2:vec! [2, 1]
Instructions: 66|N/A (*********)
L1 Hits: 91|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 4|N/A (*********)
Total read+write: 96|N/A (*********)
Estimated Cycles: 236|N/A (*********)
Comparison with bubble_sort_best_case case_2:vec! [1, 2]
Instructions: 63|66 (-4.54545%) [-1.04762x]
L1 Hits: 86|91 (-5.49451%) [-1.05814x]
L2 Hits: 1|1 (No change)
RAM Hits: 4|4 (No change)
Total read+write: 91|96 (-5.20833%) [-1.05495x]
Estimated Cycles: 231|236 (-2.11864%) [-1.02165x]
my_benchmark::bubble_sort_group::bubble_sort_worst_case multiple_0:vec! [3, 2, 1]
Instructions: 103|N/A (*********)
L1 Hits: 138|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 4|N/A (*********)
Total read+write: 143|N/A (*********)
Estimated Cycles: 283|N/A (*********)
Comparison with bubble_sort_best_case multiple_0:vec! [1, 2, 3]
Instructions: 94|103 (-8.73786%) [-1.09574x]
L1 Hits: 123|138 (-10.8696%) [-1.12195x]
L2 Hits: 1|1 (No change)
RAM Hits: 4|4 (No change)
Total read+write: 128|143 (-10.4895%) [-1.11719x]
Estimated Cycles: 268|283 (-5.30035%) [-1.05597x]
my_benchmark::bubble_sort_group::bubble_sort_worst_case multiple_1:vec! [4, 3, 2, 1]
Instructions: 154|N/A (*********)
L1 Hits: 204|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 4|N/A (*********)
Total read+write: 209|N/A (*********)
Estimated Cycles: 349|N/A (*********)
Comparison with bubble_sort_best_case multiple_1:vec! [1, 2, 3, 4]
Instructions: 136|154 (-11.6883%) [-1.13235x]
L1 Hits: 174|204 (-14.7059%) [-1.17241x]
L2 Hits: 1|1 (No change)
RAM Hits: 4|4 (No change)
Total read+write: 179|209 (-14.3541%) [-1.16760x]
Estimated Cycles: 319|349 (-8.59599%) [-1.09404x]
The procedure of the comparison algorithm:
- Run all benches in the first benchmark function
- Run the first bench in the second benchmark function and if there is a bench in the first benchmark function with the same id compare them
- Run the second bench in the second benchmark function ...
- ...
- Run the first bench in the third benchmark function and if there is a bench in the first benchmark function with the same id compare them. If there is a bench with the same id in the second benchmark function compare them.
- Run the second bench in the third benchmark function ...
- and so on ... until all benches are compared with each other
Neither the order nor the amount of benches within the benchmark functions matters, so it is not strictly necessary to mirror the bench ids of the first benchmark function in the second, third, etc. benchmark function.
Configuration
Library benchmarks can be configured with the LibraryBenchmarkConfig
and
with Command-line arguments and Environment
variables.
The LibraryBenchmarkConfig
can be specified at different levels and sets the
configuration values for the same and lower levels. The values of the
LibraryBenchmarkConfig
at higher levels can be overridden at a lower level.
Note that some values are additive rather than substitutive. Please see the docs
of the respective functions in LibraryBenchmarkConfig
for more details.
The different levels where a LibraryBenchmarkConfig
can be specified.
- At top-level with the
main!
macro
extern crate iai_callgrind; use iai_callgrind::{library_benchmark, library_benchmark_group}; use iai_callgrind::{main, LibraryBenchmarkConfig}; #[library_benchmark] fn bench() {} library_benchmark_group!(name = my_group; benchmarks = bench); fn main() { main!( config = LibraryBenchmarkConfig::default(); library_benchmark_groups = my_group ); }
- At group-level in the
library_benchmark_group!
macro
extern crate iai_callgrind; use iai_callgrind::library_benchmark; use iai_callgrind::{main, LibraryBenchmarkConfig, library_benchmark_group}; #[library_benchmark] fn bench() {} library_benchmark_group!( name = my_group; config = LibraryBenchmarkConfig::default(); benchmarks = bench ); fn main() { main!(library_benchmark_groups = my_group); }
- At
#[library_benchmark]
level
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{ main, LibraryBenchmarkConfig, library_benchmark_group, library_benchmark }; use std::hint::black_box; #[library_benchmark(config = LibraryBenchmarkConfig::default())] fn bench() { /* ... */ } library_benchmark_group!( name = my_group; config = LibraryBenchmarkConfig::default(); benchmarks = bench ); fn main() { main!(library_benchmark_groups = my_group); }
- and at
#[bench]
,#[benches]
level
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{ main, LibraryBenchmarkConfig, library_benchmark_group, library_benchmark }; use std::hint::black_box; #[library_benchmark] #[bench::some_id(args = (1, 2), config = LibraryBenchmarkConfig::default())] #[benches::multiple( args = [(3, 4), (5, 6)], config = LibraryBenchmarkConfig::default() )] fn bench(a: u8, b: u8) { /* ... */ _ = (a, b); } library_benchmark_group!( name = my_group; config = LibraryBenchmarkConfig::default(); benchmarks = bench ); fn main() { main!(library_benchmark_groups = my_group); }
Custom entry points
The EntryPoint
can be set to EntryPoint::None
which disables
the entry point, EntryPoint::Default
which uses the benchmark function as
entry point or EntryPoint::Custom
which will be discussed in more detail in
this chapter.
To understand custom entry points let's take a small detour into how
Callgrind
and Iai-Callgrind work under the hood.
Iai-Callgrind under the hood
Callgrind
collects metrics and associates them with a function. This happens
based on the compiled code not the source code, so it is possible to hook into
any function not only public functions. Callgrind
can be configured to switch
instrumentation on and off based on a function name with
--toggle-collect
. Per default, Iai-Callgrind sets this
toggle (which we call EntryPoint
) to the benchmarking function. Setting the
toggle implies --collect-atstart=no
. So, all events before (in the setup
)
and after the benchmark function (in the teardown
) are not collected. Somewhat
simplified, but conveying the basic idea, here is a commented example:
// <-- collect-at-start=no extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{main,library_benchmark_group, library_benchmark}; use std::hint::black_box; #[library_benchmark] fn bench() -> Vec<i32> { // <-- DEFAULT ENTRY POINT starts collecting events black_box(my_lib::bubble_sort(vec![3, 2, 1])) } // <-- stop collecting events library_benchmark_group!( name = my_group; benchmarks = bench); fn main() { main!(library_benchmark_groups = my_group); }
Pitfall: Inlined functions
The fact that Callgrind
acts on the compiled code harbors a pitfall. The
compiler with compile-time optimizations switched on (which is usually the case
when compiling benchmarks) inlines functions if it sees an advantage in doing
so. Iai-Callgrind takes care, that this doesn't happen with the benchmark
function, so Callgrind
can find and hook into the benchmark function. But, in
your production code you actually don't want to stop the compiler from doing
its job just to be able to benchmark that function. So, be cautious with
benchmarking private functions and only choose functions of which it is known
that they are not being inlined.
Hook into private functions
The basic idea is to choose a public function in your library acting as access point to the actual function you want to benchmark. As outlined before, this works only reliably for functions which are not inlined by the compiler.
extern crate iai_callgrind; use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig, EntryPoint }; use std::hint::black_box; mod my_lib { #[inline(never)] fn bubble_sort(input: Vec<i32>) -> Vec<i32> { // The algorithm input } pub fn access_point(input: Vec<i32>) -> Vec<i32> { println!("Doing something before the function call"); bubble_sort(input) } } #[library_benchmark( config = LibraryBenchmarkConfig::default() .entry_point(EntryPoint::Custom("*::my_lib::bubble_sort".to_owned())) )] #[bench::small(vec![3, 2, 1])] #[bench::bigger(vec![5, 4, 3, 2, 1])] fn bench_private(array: Vec<i32>) -> Vec<i32> { black_box(my_lib::access_point(array)) } library_benchmark_group!(name = my_group; benchmarks = bench_private); fn main() { main!(library_benchmark_groups = my_group); }
Note the #[inline(never)]
we use in this example to make sure the
bubble_sort
function is not getting inlined.
We use a wildcard *::my_lib::bubble_sort
for
EntryPoint::Custom
for demonstration purposes. You might want to tighten this
pattern. If you don't know how the pattern looks like, use EntryPoint::None
first then run the benchmark. Now, investigate the callgrind output
file. This output file is pretty
low-level but all you need to do is search for the entries which start with
fn=...
. In the example above this entry might look like
fn=algorithms::my_lib::bubble_sort
if my_lib
would be part of the top-level
algorithms
module. Or, using grep:
grep '^fn=.*::bubble_sort$' target/iai/the_package/benchmark_file_name/my_group/bench_private.bigger/callgrind.bench_private.bigger.out
Having found the pattern, you can eventually use EntryPoint::Custom
.
Multi-threaded and multi-process applications
The default is to run Iai-Callgrind benchmarks with --separate-threads=yes
,
--trace-children=yes
switched on. This enables Iai-Callgrind to trace threads
and subprocesses, respectively. Note that --separate-threads=yes
is not
strictly necessary to be able to trace threads. But, if they are separated,
Iai-Callgrind can collect and display the metrics for each thread. Due to the
way callgrind
applies data collection options like --toggle-collect
,
--collect-atstart
, ... further configuration is needed in library benchmarks.
To actually see the collected metrics in the terminal output for all threads
and/or subprocesses you can switch on OutputFormat::show_intermediate
:
extern crate iai_callgrind; mod my_lib { pub fn find_primes_multi_thread(_: u64) -> Vec<u64> { vec![]} } use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig, OutputFormat }; use std::hint::black_box; #[library_benchmark] fn bench_threads() -> Vec<u64> { black_box(my_lib::find_primes_multi_thread(2)) } library_benchmark_group!(name = my_group; benchmarks = bench_threads); fn main() { main!( config = LibraryBenchmarkConfig::default() .output_format(OutputFormat::default() .show_intermediate(true) ); library_benchmark_groups = my_group ); }
The best method for benchmarking threads and subprocesses depends heavily on your code. So, rather than suggesting a single "best" method for benchmarking threads and subprocesses, this chapter will run through various possible approaches and try to highlight the pros and cons of each.
Multi-threaded applications
Callgrind
treats each thread and process as a separate unit and it applies
data collection options to each unit. In library benchmarks the entry
point (or the default toggle) for callgrind
is per
default set to the benchmark function with the help of the --toggle-collect
option. Setting --toggle-collect
also automatically sets
--collect-atstart=no
. If not further customized for a benchmarked
multi-threaded function, these options cause the metrics for the spawned threads
to be zero. This happens since each thread is a separate unit with
--collect-atstart=no
and the default toggle applied to the units. The default
toggle is set to the benchmark function and does not hook into any function in
the thread, so the metrics are zero.
There are multiple ways to customize the default behaviour and actually measure
the threads. For the following examples, we're using the benchmark and library
code below to show the different customization options assuming this code lives
in a benchmark file benches/lib_bench_threads.rs
extern crate iai_callgrind; use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig, OutputFormat }; use std::hint::black_box; /// Suppose this is your library pub mod my_lib { /// Return true if `num` is a prime number pub fn is_prime(num: u64) -> bool { if num <= 1 { return false; } for i in 2..=(num as f64).sqrt() as u64 { if num % i == 0 { return false; } } true } /// Find and return all prime numbers in the inclusive range `low` to `high` pub fn find_primes(low: u64, high: u64) -> Vec<u64> { (low..=high).filter(|n| is_prime(*n)).collect() } /// Return the prime numbers in the range `0..(num_threads * 10000)` pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> { let mut handles = vec![]; let mut low = 0; for _ in 0..num_threads { let handle = std::thread::spawn(move || find_primes(low, low + 10000)); handles.push(handle); low += 10000; } let mut primes = vec![]; for handle in handles { let result = handle.join(); primes.extend(result.unwrap()) } primes } } #[library_benchmark] #[bench::two_threads(2)] fn bench_threads(num_threads: usize) -> Vec<u64> { black_box(my_lib::find_primes_multi_thread(num_threads)) } library_benchmark_group!(name = my_group; benchmarks = bench_threads); fn main() { main!( config = LibraryBenchmarkConfig::default() .output_format(OutputFormat::default() .show_intermediate(true) ); library_benchmark_groups = my_group ); }
Running this benchmark with cargo bench
will present you with the following
terminal output:
lib_bench_threads::my_group::bench_threads two_threads:2
## pid: 2097219 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 27305|N/A (*********)
L1 Hits: 66353|N/A (*********)
L2 Hits: 341|N/A (*********)
RAM Hits: 539|N/A (*********)
Total read+write: 67233|N/A (*********)
Estimated Cycles: 86923|N/A (*********)
## pid: 2097219 thread: 2 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 0|N/A (*********)
L1 Hits: 0|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 0|N/A (*********)
Total read+write: 0|N/A (*********)
Estimated Cycles: 0|N/A (*********)
## pid: 2097219 thread: 3 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 0|N/A (*********)
L1 Hits: 0|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 0|N/A (*********)
Total read+write: 0|N/A (*********)
Estimated Cycles: 0|N/A (*********)
## Total
Instructions: 27305|N/A (*********)
L1 Hits: 66353|N/A (*********)
L2 Hits: 341|N/A (*********)
RAM Hits: 539|N/A (*********)
Total read+write: 67233|N/A (*********)
Estimated Cycles: 86923|N/A (*********)
As you can see, the counts for the threads 2
and 3
(our spawned threads) are
all zero.
Measuring threads using toggles
At a first glance, setting a toggle to the function in the thread seems to be easiest way and can be done like so:
extern crate iai_callgrind; mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }} use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig, EntryPoint }; use std::hint::black_box; #[library_benchmark( config = LibraryBenchmarkConfig::default() .callgrind_args(["--toggle-collect=lib_bench_threads::my_lib::find_primes"]) )] #[bench::two_threads(2)] fn bench_threads(num_threads: usize) -> Vec<u64> { black_box(my_lib::find_primes_multi_thread(num_threads)) } library_benchmark_group!(name = my_group; benchmarks = bench_threads); fn main() { main!(library_benchmark_groups = my_group); }
This approach may or may not work, depending on whether the compiler inlines the
target function of the --toggle-collect
argument or not. This is the same
problem as with custom entry
points. As can be seen
below, the compiler has chosen to inline find_primes
and the metrics for the
threads are still zero:
lib_bench_threads::my_group::bench_threads two_threads:2
## pid: 2620776 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 27372|N/A (*********)
L1 Hits: 66431|N/A (*********)
L2 Hits: 343|N/A (*********)
RAM Hits: 538|N/A (*********)
Total read+write: 67312|N/A (*********)
Estimated Cycles: 86976|N/A (*********)
## pid: 2620776 thread: 2 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 0|N/A (*********)
L1 Hits: 0|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 0|N/A (*********)
Total read+write: 0|N/A (*********)
Estimated Cycles: 0|N/A (*********)
## pid: 2620776 thread: 3 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 0|N/A (*********)
L1 Hits: 0|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 0|N/A (*********)
Total read+write: 0|N/A (*********)
Estimated Cycles: 0|N/A (*********)
## Total
Instructions: 27372|N/A (*********)
L1 Hits: 66431|N/A (*********)
L2 Hits: 343|N/A (*********)
RAM Hits: 538|N/A (*********)
Total read+write: 67312|N/A (*********)
Estimated Cycles: 86976|N/A (*********)
Just to show what would happen if the compiler does not inline the find_primes
method, we temporarily annotate it with #[inline(never)]
:
#![allow(unused)] fn main() { /// Find and return all prime numbers in the inclusive range `low` to `high` fn is_prime(_: u64) -> bool { true } #[inline(never)] pub fn find_primes(low: u64, high: u64) -> Vec<u64> { (low..=high).filter(|n| is_prime(*n)).collect() } }
Now, running the benchmark does show the desired metrics:
lib_bench_threads::my_group::bench_threads two_threads:2
## pid: 2661917 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 27372|N/A (*********)
L1 Hits: 66431|N/A (*********)
L2 Hits: 343|N/A (*********)
RAM Hits: 538|N/A (*********)
Total read+write: 67312|N/A (*********)
Estimated Cycles: 86976|N/A (*********)
## pid: 2661917 thread: 2 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 2460503|N/A (*********)
L1 Hits: 2534938|N/A (*********)
L2 Hits: 12|N/A (*********)
RAM Hits: 186|N/A (*********)
Total read+write: 2535136|N/A (*********)
Estimated Cycles: 2541508|N/A (*********)
## pid: 2661917 thread: 3 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 3650410|N/A (*********)
L1 Hits: 3724286|N/A (*********)
L2 Hits: 4|N/A (*********)
RAM Hits: 130|N/A (*********)
Total read+write: 3724420|N/A (*********)
Estimated Cycles: 3728856|N/A (*********)
## Total
Instructions: 6138285|N/A (*********)
L1 Hits: 6325655|N/A (*********)
L2 Hits: 359|N/A (*********)
RAM Hits: 854|N/A (*********)
Total read+write: 6326868|N/A (*********)
Estimated Cycles: 6357340|N/A (*********)
But, annotating functions with #[inline(never)]
in production code is usually
not an option and preventing the compiler from doing its job is not the
preferred way to make a benchmark work. The truth is, there is no way to make
the --toggle-collect
argument work for all cases and it heavily depends on the
choices of the compiler depending on your code.
Another way to get the thread metrics is to set --collect-atstart=yes
and turn
off the EntryPoint
:
extern crate iai_callgrind; mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }} use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig, EntryPoint }; use std::hint::black_box; #[library_benchmark( config = LibraryBenchmarkConfig::default() .entry_point(EntryPoint::None) .callgrind_args(["--collect-atstart=yes"]) )] #[bench::two_threads(2)] fn bench_threads(num_threads: usize) -> Vec<u64> { black_box(my_lib::find_primes_multi_thread(num_threads)) } library_benchmark_group!(name = my_group; benchmarks = bench_threads); fn main() { main!(library_benchmark_groups = my_group); }
But, the metrics of the main thread will include all the setup (and teardown)
code from the benchmark executable (so the instructions of the main thread go up
from 27372
to 404425
):
lib_bench_threads::my_group::bench_threads two_threads:2
## pid: 2697019 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 404425|N/A (*********)
L1 Hits: 570186|N/A (*********)
L2 Hits: 1307|N/A (*********)
RAM Hits: 4856|N/A (*********)
Total read+write: 576349|N/A (*********)
Estimated Cycles: 746681|N/A (*********)
## pid: 2697019 thread: 2 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 2466864|N/A (*********)
L1 Hits: 2543314|N/A (*********)
L2 Hits: 81|N/A (*********)
RAM Hits: 409|N/A (*********)
Total read+write: 2543804|N/A (*********)
Estimated Cycles: 2558034|N/A (*********)
## pid: 2697019 thread: 3 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 3656729|N/A (*********)
L1 Hits: 3732802|N/A (*********)
L2 Hits: 31|N/A (*********)
RAM Hits: 201|N/A (*********)
Total read+write: 3733034|N/A (*********)
Estimated Cycles: 3739992|N/A (*********)
## Total
Instructions: 6528018|N/A (*********)
L1 Hits: 6846302|N/A (*********)
L2 Hits: 1419|N/A (*********)
RAM Hits: 5466|N/A (*********)
Total read+write: 6853187|N/A (*********)
Estimated Cycles: 7044707|N/A (*********)
Additionally, expect a lot of metric changes if the benchmarks itself are changed. However, if the metrics of the main thread are not significant compared to the total, this might be an applicable (last) choice.
There is another more reliable way as shown below in the next section.
Measuring threads using client requests
The perhaps most reliable and flexible way to measure threads is using client requests. The downside is that you have to put some benchmark code into your production code. But, if you followed the installation instructions in client requests, this additional code is only compiled in benchmarks, not in your final production-ready library.
Using the callgrind client request, we adjust the threads in the
find_primes_multi_thread
function like so:
#![allow(unused)] fn main() { fn find_primes(_a: u64, _b: u64) -> Vec<u64> { vec![] } extern crate iai_callgrind; use iai_callgrind::client_requests::callgrind; /// Return the prime numbers in the range `0..(num_threads * 10000)` pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> { let mut handles = vec![]; let mut low = 0; for _ in 0..num_threads { let handle = std::thread::spawn(move || { callgrind::toggle_collect(); let result = find_primes(low, low + 10000); callgrind::toggle_collect(); result }); handles.push(handle); low += 10000; } let mut primes = vec![]; for handle in handles { let result = handle.join(); primes.extend(result.unwrap()) } primes } }
and running the same benchmark now will show the collected metrics of the threads:
lib_bench_threads::my_group::bench_threads two_threads:2
## pid: 2149242 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 27305|N/A (*********)
L1 Hits: 66352|N/A (*********)
L2 Hits: 344|N/A (*********)
RAM Hits: 537|N/A (*********)
Total read+write: 67233|N/A (*********)
Estimated Cycles: 86867|N/A (*********)
## pid: 2149242 thread: 2 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 2460501|N/A (*********)
L1 Hits: 2534935|N/A (*********)
L2 Hits: 13|N/A (*********)
RAM Hits: 185|N/A (*********)
Total read+write: 2535133|N/A (*********)
Estimated Cycles: 2541475|N/A (*********)
## pid: 2149242 thread: 3 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 3650408|N/A (*********)
L1 Hits: 3724285|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 131|N/A (*********)
Total read+write: 3724417|N/A (*********)
Estimated Cycles: 3728875|N/A (*********)
## Total
Instructions: 6138214|N/A (*********)
L1 Hits: 6325572|N/A (*********)
L2 Hits: 358|N/A (*********)
RAM Hits: 853|N/A (*********)
Total read+write: 6326783|N/A (*********)
Estimated Cycles: 6357217|N/A (*********)
Using the client request toggles is very flexible since you can put the
iai_callgrind::client_requests::callgrind::toggle_collect
instructions
anywhere in the threads. In this example, we just have a single function in the
thread, but if your threads consist of more than just a single function, you can
easily exclude uninteresting parts from the final measurements.
If you want to prevent the code of the main thread from being measured, you can use the following:
extern crate iai_callgrind; mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }} use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig, EntryPoint }; use std::hint::black_box; #[library_benchmark( config = LibraryBenchmarkConfig::default() .entry_point(EntryPoint::None) .callgrind_args(["--collect-atstart=no"]) )] #[bench::two_threads(2)] fn bench_threads(num_threads: usize) -> Vec<u64> { black_box(my_lib::find_primes_multi_thread(num_threads)) } library_benchmark_group!(name = my_group; benchmarks = bench_threads); fn main() { main!(library_benchmark_groups = my_group); }
Setting the EntryPoint::None
disables the default toggle but also
--collect-atstart=no
, which is why we have to set the option manually.
Altogether, running the benchmark will show:
lib_bench_threads::my_group::bench_threads two_threads:2
## pid: 2251257 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 0|N/A (*********)
L1 Hits: 0|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 0|N/A (*********)
Total read+write: 0|N/A (*********)
Estimated Cycles: 0|N/A (*********)
## pid: 2251257 thread: 2 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 2460501|N/A (*********)
L1 Hits: 2534935|N/A (*********)
L2 Hits: 11|N/A (*********)
RAM Hits: 187|N/A (*********)
Total read+write: 2535133|N/A (*********)
Estimated Cycles: 2541535|N/A (*********)
## pid: 2251257 thread: 3 part: 1 |N/A
Command: target/release/deps/lib_bench_threads-b85159a94ccb3851
Instructions: 3650408|N/A (*********)
L1 Hits: 3724282|N/A (*********)
L2 Hits: 4|N/A (*********)
RAM Hits: 131|N/A (*********)
Total read+write: 3724417|N/A (*********)
Estimated Cycles: 3728887|N/A (*********)
## Total
Instructions: 6110909|N/A (*********)
L1 Hits: 6259217|N/A (*********)
L2 Hits: 15|N/A (*********)
RAM Hits: 318|N/A (*********)
Total read+write: 6259550|N/A (*********)
Estimated Cycles: 6270422|N/A (*********)
Multi-process applications
Measuring multi-process applications is in principal not that different from multi-threaded applications since subprocesses are just like threads separate units. As for threads, the data collection options are applied to subprocesses separately from the main process.
Note there are multiple valgrind command-line arguments that can disable the collection of metrics for uninteresting subprocesses, for example subprocesses that are spawned by your library function but are not part of your library/binary crate.
For the following examples suppose the code below is the cat
binary and part
of a crate (so we can use
env!("CARGO_BIN_EXE_cat")
):
use std::fs::File; use std::io::{copy, stdout, BufReader, BufWriter, Write}; fn main() { fn main() { let mut args_iter = std::env::args().skip(1); let file_arg = args_iter.next().expect("File argument should be present"); let file = File::open(file_arg).expect("Opening file should succeed"); let stdout = stdout().lock(); let mut writer = BufWriter::new(stdout); copy(&mut BufReader::new(file), &mut writer) .expect("Printing file to stdout should succeed"); writer.flush().expect("Flushing writer should succeed"); } }
The above binary is a very simple version of cat
taking a single file
argument. The file content is read and dumped to the stdout
. The following is
the benchmark and library code to show the different options assuming this code
is stored in a benchmark file benches/lib_bench_subprocess.rs
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use std::hint::black_box; use std::io; use std::path::PathBuf; use std::process::ExitStatus; use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, OutputFormat, }; /// Suppose this is your library pub mod my_lib { use std::io; use std::path::Path; use std::process::ExitStatus; /// A function executing the crate's binary `cat` pub fn cat(file: &Path) -> io::Result<ExitStatus> { std::process::Command::new(env!("CARGO_BIN_EXE_cat")) .arg(file) .status() } } /// Create a file `/tmp/foo.txt` with some content fn create_file() -> PathBuf { let path = PathBuf::from("/tmp/foo.txt"); std::fs::write(&path, "some content").unwrap(); path } #[library_benchmark] #[bench::some(setup = create_file)] fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> { black_box(my_lib::cat(&path)) } library_benchmark_group!(name = my_group; benchmarks = bench_subprocess); fn main() { main!( config = LibraryBenchmarkConfig::default() .output_format(OutputFormat::default() .show_intermediate(true) ); library_benchmark_groups = my_group ); }
Running the above benchmark with cargo bench
results in the following terminal
output:
lib_bench_subprocess::my_group::bench_subprocess some:create_file()
## pid: 3141785 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
Instructions: 4467|N/A (*********)
L1 Hits: 6102|N/A (*********)
L2 Hits: 17|N/A (*********)
RAM Hits: 186|N/A (*********)
Total read+write: 6305|N/A (*********)
Estimated Cycles: 12697|N/A (*********)
## pid: 3141786 thread: 1 part: 1 |N/A
Command: target/release/cat /tmp/foo.txt
Instructions: 0|N/A (*********)
L1 Hits: 0|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 0|N/A (*********)
Total read+write: 0|N/A (*********)
Estimated Cycles: 0|N/A (*********)
## Total
Instructions: 4467|N/A (*********)
L1 Hits: 6102|N/A (*********)
L2 Hits: 17|N/A (*********)
RAM Hits: 186|N/A (*********)
Total read+write: 6305|N/A (*********)
Estimated Cycles: 12697|N/A (*********)
As expected, the cat
subprocess is not measured and the metrics are zero for
the same reasons as the initial measurement of threads.
Measuring subprocesses using toggles
The great advantage over measuring threads is that each process has a main
function that is not inlined by the compiler and can serve as a reliable hook
for the --toggle-collect
argument so the following adaption to the above
benchmark will just work:
extern crate iai_callgrind; mod my_lib { use std::{io, path::Path, process::ExitStatus}; pub fn cat(_: &Path) -> io::Result<ExitStatus> { std::process::Command::new("some").status() }} fn create_file() -> PathBuf { PathBuf::from("some") } use std::hint::black_box; use std::io; use std::path::PathBuf; use std::process::ExitStatus; use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, OutputFormat, }; #[library_benchmark( config = LibraryBenchmarkConfig::default() .callgrind_args(["--toggle-collect=cat::main"]) )] #[bench::some(setup = create_file)] fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> { black_box(my_lib::cat(&path)) } library_benchmark_group!(name = my_group; benchmarks = bench_subprocess); fn main() { main!(library_benchmark_groups = my_group); }
producing the desired output
lib_bench_subprocess::my_group::bench_subprocess some:create_file()
## pid: 3324117 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
Instructions: 4475|N/A (*********)
L1 Hits: 6112|N/A (*********)
L2 Hits: 14|N/A (*********)
RAM Hits: 187|N/A (*********)
Total read+write: 6313|N/A (*********)
Estimated Cycles: 12727|N/A (*********)
## pid: 3324119 thread: 1 part: 1 |N/A
Command: target/release/cat /tmp/foo.txt
Instructions: 4019|N/A (*********)
L1 Hits: 5575|N/A (*********)
L2 Hits: 12|N/A (*********)
RAM Hits: 167|N/A (*********)
Total read+write: 5754|N/A (*********)
Estimated Cycles: 11480|N/A (*********)
## Total
Instructions: 8494|N/A (*********)
L1 Hits: 11687|N/A (*********)
L2 Hits: 26|N/A (*********)
RAM Hits: 354|N/A (*********)
Total read+write: 12067|N/A (*********)
Estimated Cycles: 24207|N/A (*********)
Measuring subprocesses using client requests
Naturally, client requests can also be used to measure subprocesses. The
callgrind client requests are added to the code of the cat
binary:
extern crate iai_callgrind; use std::fs::File; use std::io::{copy, stdout, BufReader, BufWriter, Write}; use iai_callgrind::client_requests::callgrind; fn main() { fn main() { let mut args_iter = std::env::args().skip(1); let file_arg = args_iter.next().expect("File argument should be present"); callgrind::toggle_collect(); let file = File::open(file_arg).expect("Opening file should succeed"); let stdout = stdout().lock(); let mut writer = BufWriter::new(stdout); copy(&mut BufReader::new(file), &mut writer) .expect("Printing file to stdout should succeed"); writer.flush().expect("Flushing writer should succeed"); callgrind::toggle_collect(); } }
For the purpose of this example we decided that measuring the parsing of the command-line-arguments is not interesting for us and excluded it from the collected metrics. The benchmark itself is reverted to its original state without the toggle:
extern crate iai_callgrind; mod my_lib { use std::{io, path::Path, process::ExitStatus}; pub fn cat(_: &Path) -> io::Result<ExitStatus> { std::process::Command::new("some").status() }} fn create_file() -> PathBuf { PathBuf::from("some") } use std::hint::black_box; use std::io; use std::path::PathBuf; use std::process::ExitStatus; use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, OutputFormat, }; #[library_benchmark] #[bench::some(setup = create_file)] fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> { black_box(my_lib::cat(&path)) } library_benchmark_group!(name = my_group; benchmarks = bench_subprocess); fn main() { main!(library_benchmark_groups = my_group); }
Now, running the benchmark shows
lib_bench_subprocess::my_group::bench_subprocess some:create_file()
## pid: 3421822 thread: 1 part: 1 |N/A
Command: target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
Instructions: 4467|N/A (*********)
L1 Hits: 6102|N/A (*********)
L2 Hits: 17|N/A (*********)
RAM Hits: 186|N/A (*********)
Total read+write: 6305|N/A (*********)
Estimated Cycles: 12697|N/A (*********)
## pid: 3421823 thread: 1 part: 1 |N/A
Command: target/release/cat /tmp/foo.txt
Instructions: 2429|N/A (*********)
L1 Hits: 3406|N/A (*********)
L2 Hits: 8|N/A (*********)
RAM Hits: 138|N/A (*********)
Total read+write: 3552|N/A (*********)
Estimated Cycles: 8276|N/A (*********)
## Total
Instructions: 6896|N/A (*********)
L1 Hits: 9508|N/A (*********)
L2 Hits: 25|N/A (*********)
RAM Hits: 324|N/A (*********)
Total read+write: 9857|N/A (*********)
Estimated Cycles: 20973|N/A (*********)
As expected, the metrics for the cat
binary are a little bit lower since we
skipped measuring the parsing of the command-line arguments.
Even more Examples
I'm referring here to the github repository. We test the library benchmarks functionality of Iai-Callgrind with system tests in the private benchmark-tests package.
Each system test there can serve you as an example, but for a fully documented and commented one see here.
Binary Benchmarks
You want to start benchmarking your crate's binary? Best start with the Quickstart section.
Setting up binary benchmarks is very similar to library benchmarks, and it's a good idea to have a look at the library benchmark section of this guide, too.
You may then come back to the binary benchmarks section and go through the differences
If you need more examples see here.
Important default behaviour
As in library benchmarks, the environment variables are cleared before running a binary benchmark. Have a look at the Configuration section if you want to change this behavior. Iai-Callgrind sometimes deviates from the valgrind defaults which are:
Iai-Callgrind | Valgrind (v3.23) |
---|---|
--trace-children=yes | --trace-children=no |
--fair-sched=try | --fair-sched=no |
--separate-threads=yes | --separate-threads=no |
--cache-sim=yes | --cache-sim=no |
As show in the table above, the benchmarks run with cache simulation switched on. This adds run time for each benchmark. If you don't need the cache metrics and estimation of cycles, you can easily switch cache simulation off for example with
#![allow(unused)] fn main() { extern crate iai_callgrind; use iai_callgrind::BinaryBenchmarkConfig; BinaryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]); }
To switch off cache simulation for all benchmarks in the same file:
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{ binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig }; #[binary_benchmark] fn bench_binary() -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!( config = BinaryBenchmarkConfig::default().callgrind_args(["--cache-sim=no"]); binary_benchmark_groups = my_group ); }
Quickstart
Suppose the crate's binary is called my-foo
and this binary takes a file path
as positional argument. This first example shows the basic usage of the
high-level api with the #[binary_benchmark]
attribute:
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{binary_benchmark, binary_benchmark_group, main}; #[binary_benchmark] #[bench::some_id("foo.txt")] fn bench_binary(path: &str) -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg(path) .build() } binary_benchmark_group!( name = my_group; benchmarks = bench_binary ); fn main() { main!(binary_benchmark_groups = my_group); }
If you want to try out this example with your crate's binary, put the above code
into a file in $WORKSPACE_ROOT/benches/binary_benchmark.rs
. Next, replace
my-foo
in env!("CARGO_BIN_EXE_my-foo")
with the name of a binary of your
crate.
Note the env!
macro is a rust
builtin macro and CARGO_BIN_EXE_<name>
is documented
here.
You should always use env!("CARGO_BIN_EXE_<name>")
to determine the path to
the binary of your crate. Do not use relative paths like target/release/my-foo
since this might break your benchmarks in many ways. The environment variable
does exactly the right thing and the usage is short and simple.
Lastly, adjust the argument of the Command
and add the following to your
Cargo.toml
:
[[bench]]
name = "binary_benchmark"
harness = false
Running
cargo bench
presents you with something like the following:
binary_benchmark::my_group::bench_binary some_id:("foo.txt") -> target/release/my-foo foo.txt
Instructions: 342129|N/A (*********)
L1 Hits: 457370|N/A (*********)
L2 Hits: 734|N/A (*********)
RAM Hits: 4096|N/A (*********)
Total read+write: 462200|N/A (*********)
Estimated Cycles: 604400|N/A (*********)
As opposed to library benchmarks, binary benchmarks have access to a low-level api. Here, pretty much the same as the above high-level usage but written in the low-level api:
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{BinaryBenchmark, Bench, binary_benchmark_group, main}; binary_benchmark_group!( name = my_group; benchmarks = |group: &mut BinaryBenchmarkGroup| { group.binary_benchmark(BinaryBenchmark::new("bench_binary") .bench(Bench::new("some_id") .command(iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg("foo.txt") .build() ) ) ) } ); fn main() { main!(binary_benchmark_groups = my_group); }
If in doubt, use the high-level api. You can still migrate to the low-level api very easily if you really need to. The other way around is more involved.
Differences to library benchmarks
In this section we're going through the differences to library benchmarks. This assumes that you already know how to set up library benchmarks, and it is recommended to learn the very basics about library benchmarks, starting with Quickstart, Anatomy of a library benchmark and The macros in more detail. Then come back to this section.
Name changes
Coming from library benchmarks, the names with library
in it change to the
same name but library
with binary
replaced, so the #[library_benchmark]
attribute's name changes to #[binary_benchmark]
and library_benchmark_group!
changes to binary_benchmark_group!
, the config arguments take a
BinaryBenchmarkConfig
instead of a LibraryBenchmarkConfig
...
A quick reference of available macros in binary benchmarks:
#[binary_benchmark]
and its inner attributes#[bench]
and#[benches]
: The exact pendant to the#[library_benchmark]
attribute macro.binary_benchmark_group!
: Just the name of the macro has changed.binary_benchmark_attribute!
: An additional macro if you intend to migrate from the high-level to the low-level apimain!
: The same macro as in library benchmarks but the name of thelibrary_benchmark_groups
parameter changed tobinary_benchmark_groups
.
To see all macros in action have a look at the example below.
The return value of the benchmark function
The maybe most important difference is, that the #[binary_benchmark]
annotated
function always needs to return an iai_callgrind::Command
. Note this function
builds the command which is going to be benchmarked but doesn't execute it,
yet. So, the code in this function does not attribute to the event counts of the
actual benchmark.
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{binary_benchmark, binary_benchmark_group, main}; use std::path::PathBuf; #[binary_benchmark] #[bench::foo("foo.txt")] #[bench::bar("bar.json")] fn bench_binary(path: &str) -> iai_callgrind::Command { // We can put any code in this function which is needed to configure and // build the `Command`. let path = PathBuf::from(path); // Here, if the `path` ends with `.txt` we want to see // the `Stdout` output of the `Command` in the benchmark output. In all other // cases, the `Stdout` of the `Command` is redirected to a `File` with the // same name as the input `path` but with the extension `out`. let stdout = if path.extension().unwrap() == "txt" { iai_callgrind::Stdio::Inherit } else { iai_callgrind::Stdio::File(path.with_extension("out")) }; iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .stdout(stdout) .arg(path) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
setup
and teardown
Since we can put any code building the Command
in the function itself, the
setup
and teardown
of #[binary_benchmark]
, #[bench]
and #[benches]
work differently.
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{binary_benchmark, binary_benchmark_group, main}; fn create_file() { std::fs::write("foo.txt", "some content").unwrap(); } #[binary_benchmark] #[bench::foo(args = ("foo.txt"), setup = create_file())] fn bench_binary(path: &str) -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg(path) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
setup
, which is here the expression create_file()
, is not evaluated right
away and the return value of setup
is not used as input for the function
!
Instead, the expression in setup
is getting evaluated and executed just before
the benchmarked Command
is executed. Similarly, teardown
is executed
after the Command
is executed.
In the example above, setup
creates always the same file and is pretty static.
It's possible to use the same arguments for setup
(teardown
) and the
function
using the path (or file pointer) to a function as you're used to from
library benchmarks:
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{binary_benchmark, binary_benchmark_group, main}; fn create_file(path: &str) { std::fs::write(path, "some content").unwrap(); } fn delete_file(path: &str) { std::fs::remove_file(path).unwrap(); } #[binary_benchmark] // Note the missing parentheses for `setup` of the function `create_file` which // tells Iai-Callgrind to pass the `args` to the `setup` function AND the // function `bench_binary` #[bench::foo(args = ("foo.txt"), setup = create_file)] // Same for `teardown` #[bench::bar(args = ("bar.txt"), setup = create_file, teardown = delete_file)] fn bench_binary(path: &str) -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg(path) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
The Command's stdin and simulating piped input
The behaviour of Stdin
of the Command
can be changed, almost the same way as
the Stdin
of a std::process::Command
with the only difference, that we use
the enums iai_callgrind::Stdin
and iai_callgrind::Stdio
. These enums provide
the variants Inherit
(the equivalent of std::process::Stdio::inherit
),
Pipe
(the equivalent of std::process::Stdio::piped
) and so on. There's also
File
which takes a PathBuf
to the file which is used as Stdin
for the
Command
. This corresponds to a redirection in the shell as in my-foo < path/to/file
.
Moreover, iai_callgrind::Stdin
provides the Stdin::Setup
variant specific to
Iai-Callgrind:
Applications may change their behaviour if the input or the Stdin
of the
Command
is coming from a pipe as in echo "some content" | my-foo
. To be able
to benchmark such cases, it is possible to use the output of setup
to Stdout
or Stderr
as Stdin
for the Command
.
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{binary_benchmark, binary_benchmark_group, main, Stdin, Pipe}; fn setup_pipe() { println!( "The output to `Stdout` here will be the input or `Stdin` of the `Command`" ); } #[binary_benchmark] #[bench::foo(setup = setup_pipe())] fn bench_binary() -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .stdin(Stdin::Setup(Pipe::Stdout)) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
Usually, setup
then the Command
and then teardown
are executed
sequentially, each waiting for the previous process to exit successfully (See
also Configure the exit code of the Command). If
the Command::stdin
changes to Stdin::Setup
, setup
and the Command
are
executed in parallel and Iai-Callgrind waits first for the Command
to exit,
then setup
. After the successful exit of setup
, teardown
is executed.
Since setup
and Command
are run in parallel if Stdin::Setup
is used, it is
sometimes necessary to delay the execution of the Command
. Please see the
delay
chapter for more details.
Configuration
The configuration of binary benchmarks works the same way as in library
benchmarks with the name changing from LibraryBenchmarkConfig
to
BinaryBenchmarkConfig
. Please see
there for the basics. However, Binary
benchmarks have some additional configuration possibilities:
Delay the Command
Delaying the execution of the Command
with Command::delay
might be necessary
if the setup
is executed in parallel either with Command::setup_parallel
or
Command::stdin
set to Stdin::Setup
.
For example, if you have a server which needs to be started in the setup to be
able to benchmark a client (in our example a crate's binary simply named
client
):
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use std::net::{SocketAddr, TcpListener}; use std::time::Duration; use std::thread; use iai_callgrind::{ binary_benchmark, binary_benchmark_group, main, Delay, DelayKind }; const ADDRESS: &str = "127.0.0.1:31000"; fn setup_tcp_server() { println!("Waiting to start server..."); thread::sleep(Duration::from_millis(300)); println!("Starting server..."); let listener = TcpListener::bind( ADDRESS.parse::<SocketAddr>().unwrap() ).unwrap(); thread::sleep(Duration::from_secs(1)); drop(listener); println!("Stopped server..."); } #[binary_benchmark(setup = setup_tcp_server())] fn bench_client() -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_client")) .setup_parallel(true) .delay( Delay::new(DelayKind::TcpConnect( ADDRESS.parse::<SocketAddr>().unwrap(), )) .timeout(Duration::from_millis(500)), ) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_client); fn main() { main!(binary_benchmark_groups = my_group); }
The server is started in the parallel setup function setup_tcp_server
since
Command::setup_parallel
is set to true. The delay of the Command
is
configured with Delay
in Command::delay
to wait for the tcp connection to be
available. We also applied a timeout of 500
milliseconds with
Delay::timeout
, so if something goes wrong in the server and the tcp
connection cannot be established, the benchmark exits with an error after 500
milliseconds instead of hanging forever. After the successful delay, the actual
client is executed and benchmarked. After the exit of the client, the setup is
waited for to exit successfully. Then, if present, the teardown
function is
executed.
Please see the library documentation for all possible DelayKind
s and more
details on the Delay
.
Sandbox
The
Sandbox
is a temporary directory which is created before the execution of the setup
and deleted after the teardown
. setup
, the Command
and teardown
are
executed inside this temporary directory. This simply describes the order of the
execution but the setup
or teardown
don't need to be present.
Why using a Sandbox?
A Sandbox
can help mitigating differences in benchmark results on different
machines. As long as $TMP_DIR
is unset or set to /tmp
, the temporary
directory has a constant length on unix machines (except android
which uses /data/local/tmp
). The directory itself is created with a constant
length but random name like /tmp/.a23sr8fk
.
It is not implausible that an executable has different event counts just because
the directory it is executed in has a different length. For example, if a member
of your project has set up the project in /home/bob/workspace/our-project
running the benchmarks in this directory, and the ci runs the benchmarks in
/runner/our-project
, the event counts might differ. If possible, the
benchmarks should be run in a constant environment. For example clearing the
environment variables is also such a measure.
Other good reasons for using a Sandbox
are convenience, e.g. if you create
files during the setup
and Command
run and do not want to delete all files
manually. Or, maybe more importantly, if the Command
is destructive and
deletes files, it is usually safer to run such a Command
in a temporary
directory where it cannot cause damage to your or other file systems.
The Sandbox
is deleted after the benchmark, regardless of whether the
benchmark run was successful or not. The latter is not guaranteed if you only
rely on teardown
, as teardown
is only executed if the Command
returns
without error.
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{ binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox }; fn create_file(path: &str) { std::fs::write(path, "some content").unwrap(); } #[binary_benchmark] #[bench::foo( args = ("foo.txt"), config = BinaryBenchmarkConfig::default().sandbox(Sandbox::new(true)), setup = create_file )] fn bench_binary(path: &str) -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg(path) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
In this example, as part of the setup
, the create_file
function with the
argument foo.txt
is executed in the Sandbox
before the Command
is
executed. The Command
is executed in the same Sandbox
and therefore the file
foo.txt
with the content some content
exists thanks to the setup
. After
the execution of the Command
, the Sandbox
is completely removed, deleting
all files created during setup
, the Command
execution (and teardown
if it
had been present in this example).
Since setup
is run in the sandbox, you can't copy fixtures from your project's
workspace into the sandbox that easily anymore. The Sandbox
can be configured
to copy fixtures
into the temporary directory with Sandbox::fixtures
:
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{ binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox }; #[binary_benchmark] #[bench::foo( args = ("foo.txt"), config = BinaryBenchmarkConfig::default() .sandbox(Sandbox::new(true) .fixtures(["benches/foo.txt"])), )] fn bench_binary(path: &str) -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg(path) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
The above will copy the fixture file foo.txt
in the benches
directory into
the sandbox root as foo.txt
. Relative paths in Sandbox::fixtures
are
interpreted relative to the workspace root. In a multi-crate workspace this is
the directory with the top-level Cargo.toml
file. Paths in Sandbox::fixtures
are not limited to files, they can be directories, too.
If you have more complex demands, you can access the workspace root via the
environment variable _WORKSPACE_ROOT
in setup
and teardown
. Suppose, there
is a fixture located in /home/the_project/foo_crate/benches/fixtures/foo.txt
with the_project
being the workspace root and foo_crate
a workspace member
with the my-foo
executable. If the command is expected to create a file
bar.json
, which needs further inspection after the benchmarks have run, let's
copy it into a temporary directory tmp
(which may or may not exist) in
foo_crate
:
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{ binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, Sandbox }; use std::path::PathBuf; fn copy_fixture(path: &str) { let workspace_root = PathBuf::from(std::env::var_os("_WORKSPACE_ROOT").unwrap()); std::fs::copy( workspace_root.join("foo_crate").join("benches").join("fixtures").join(path), path ); } // This function will fail if `bar.json` does not exist, which is fine as this // file is expected to be created by `my-foo`. So, if this file does not exist, // an error will occur and the benchmark will fail. Although benchmarks are not // expected to test the correctness of the application, the `teardown` can be // used to check postconditions for a successful command run. fn copy_back(path: &str) { let workspace_root = PathBuf::from(std::env::var_os("_WORKSPACE_ROOT").unwrap()); let dest_dir = workspace_root.join("foo_crate").join("tmp"); if !dest_dir.exists() { std::fs::create_dir(&dest_dir).unwrap(); } std::fs::copy(path, dest_dir.join(path)); } #[binary_benchmark] #[bench::foo( args = ("foo.txt"), config = BinaryBenchmarkConfig::default().sandbox(Sandbox::new(true)), setup = copy_fixture, teardown = copy_back("bar.json") )] fn bench_binary(path: &str) -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg(path) .build() } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
Configure the exit code of the Command
Usually, if a Command
exits with a non-zero exit code, the whole benchmark run
fails and stops. If the exit code of the benchmarked Command
is to be expected
different from 0
, the expected exit code can be set in
BinaryBenchmarkConfig::exit_with
or Command::exit_with
:
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{ binary_benchmark, binary_benchmark_group, main, BinaryBenchmarkConfig, ExitWith }; #[binary_benchmark] // Here, we set the expected exit code of `my-foo` to 2 #[bench::exit_with_2( config = BinaryBenchmarkConfig::default().exit_with(ExitWith::Code(2)) )] // Here, we don't know the exact exit code but know it is different from 0 (=success) #[bench::exit_with_failure( config = BinaryBenchmarkConfig::default().exit_with(ExitWith::Failure) )] fn bench_binary() -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) } binary_benchmark_group!(name = my_group; benchmarks = bench_binary); fn main() { main!(binary_benchmark_groups = my_group); }
Low-level api
I'm not going into full detail of the low-level api here since it is fully documented in the api Documentation.
The basic structure
The entry point of the low-level api is the binary_benchmark_group
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{ binary_benchmark, binary_benchmark_attribute, binary_benchmark_group, main, BinaryBenchmark, Bench }; binary_benchmark_group!( name = my_group; benchmarks = |group: &mut BinaryBenchmarkGroup| { group.binary_benchmark(BinaryBenchmark::new("bench_binary") .bench(Bench::new("some_id") .command(iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-foo")) .arg("foo.txt") .build() ) ) ) } ); fn main() { main!(binary_benchmark_groups = my_group); }
The low-level api mirrors the high-level api, "structifying" the macros.
The binary_benchmark_group!
is also a struct now, the BinaryBenchmarkGroup
.
It cannot be instantiated. Instead, it is passed as argument to the expression
of the benchmarks
parameter in a binary_benchmark_group
. You can choose any
name instead of group
, we just have used group
throughout the examples.
There's the shorter benchmarks = |group| /* ... */
instead of benchmarks = |group: &mut BinaryBenchmarkGroup| /* ... */
. We use the more verbose variant
in the examples because it is more informative for benchmarking starters.
Furthermore, the #[library_benchmark]
macro correlates with
iai_callgrind::LibraryBenchmark
and #[bench]
with iai_callgrind::Bench
.
The parameters of the macros are now functions in the respective structs. The
return value of the benchmark function, the iai-callgrind::Command
, is now
also a function iai-callgrind::Bench::command
.
Note there is no iai-callgrind::Benches
struct since specifying multiple
commands with iai_callgrind::Bench::command
behaves exactly the same way as
the #[benches]
attribute. So, the file
parameter of #[benches]
is a part
of iai-callgrind::Bench
and can be used with the iai-callgrind::Bench::file
function.
Intermixing high-level and low-level api
It is recommended to start with the high-level api using the
#[binary_benchmark]
attribute, since you can fall back to the low-level api in
a few steps with the binary_benchmark_attribute!
macro as shown below. The
other way around is much more involved.
extern crate iai_callgrind; macro_rules! env { ($m:tt) => {{ "/some/path" }} } use iai_callgrind::{ binary_benchmark, binary_benchmark_attribute, binary_benchmark_group, main, BinaryBenchmark, Bench }; #[binary_benchmark] #[bench::some_id("foo")] fn attribute_benchmark(arg: &str) -> iai_callgrind::Command { iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-binary")) .arg(arg) .build() } binary_benchmark_group!( name = low_level; benchmarks = |group: &mut BinaryBenchmarkGroup| { group .binary_benchmark(binary_benchmark_attribute!(attribute_benchmark)) .binary_benchmark( BinaryBenchmark::new("low_level_benchmark") .bench( Bench::new("some_id").command( iai_callgrind::Command::new(env!("CARGO_BIN_EXE_my-binary")) .arg("bar") .build() ) ) ) } ); fn main() { main!(binary_benchmark_groups = low_level); }
As shown above, there's no need to transcribe the function attribute_benchmark
with the #[binary_benchmark]
attribute into the low-level api structures. Just
keep it as it is and add it to a the group
with
group.binary_benchmark(binary_benchmark_attribute(attribute_benchmark))
.
That's it! You can continue hacking on your benchmarks in the low-level api.
More examples needed?
As in library benchmarks, I'm referring here to the github repository. The binary benchmarks functionality of Iai-Callgrind is tested with system tests in the private benchmark-tests package.
Each system test there can serve you as an example, but for a fully documented and commented one see here.
Performance Regressions
With Iai-Callgrind you can define limits for each event kinds over which a
performance regression can be assumed. Per default, Iai-Callgrind does not
perform default regression checks, and you have to opt-in with a
RegressionConfig
at benchmark level with a LibraryBenchmarkConfig
or
BinaryBenchmarkConfig
or at a global level with Command-line arguments or
Environment variables.
Define a performance regression
A performance regression check consists of an EventKind
and a percentage. If
the percentage is negative, then a regression is assumed to be below this limit.
The default EventKind
is EventKind::Ir
with a value of +10%
.
For example, in a Library
Benchmark, define a limit of
+5%
for the total instructions executed (the Ir
event kind) in all
benchmarks of this file :
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, RegressionConfig, EventKind }; use std::hint::black_box; #[library_benchmark] fn bench_library() -> Vec<i32> { black_box(my_lib::bubble_sort(vec![3, 2, 1])) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!( config = LibraryBenchmarkConfig::default() .regression( RegressionConfig::default() .limits([(EventKind::Ir, 5.0)]) ); library_benchmark_groups = my_group ); }
Now, if the comparison of the Ir
events of the current bench_library
benchmark run with the previous run results in an increase of over 5%, the
benchmark fails. Please, also have a look at the api docs
for further configuration options.
Running the benchmark from above the first time results in the following output:
my_benchmark::my_group::bench_library
Instructions: 215|N/A (*********)
L1 Hits: 288|N/A (*********)
L2 Hits: 0|N/A (*********)
RAM Hits: 7|N/A (*********)
Total read+write: 295|N/A (*********)
Estimated Cycles: 533|N/A (*********)
Let's assume there's a change in my_lib::bubble_sort
which has increased the
instruction counts, then running the benchmark again results in an output
something similar to this:
my_benchmark::my_group::bench_library
Instructions: 281|215 (+30.6977%) [+1.30698x]
L1 Hits: 374|288 (+29.8611%) [+1.29861x]
L2 Hits: 0|0 (No change)
RAM Hits: 8|7 (+14.2857%) [+1.14286x]
Total read+write: 382|295 (+29.4915%) [+1.29492x]
Estimated Cycles: 654|533 (+22.7017%) [+1.22702x]
Performance has regressed: Instructions (281 > 215) regressed by +30.6977% (>+5.00000)
iai_callgrind_runner: Error: Performance has regressed.
error: bench failed, to rerun pass `-p the-crate --bench my_benchmark`
Caused by:
process didn't exit successfully: `/path/to/your/project/target/release/deps/my_benchmark-a9b36fec444944bd --bench` (exit status: 1)
error: Recipe `bench-test` failed on line 175 with exit code 1
Which event to choose to measure performance regressions?
If in doubt, the definite answer is Ir
(instructions executed). If Ir
event
counts decrease noticeable the function (binary) runs faster. The inverse
statement is also true: If the Ir
counts increase noticeable, there's a
slowdown of the function (binary).
These statements are not so easy to transfer to Estimated Cycles
and the other
event counts. But, depending on the scenario and the function (binary) under
test, it can be reasonable to define more regression checks.
Who actually uses instructions to measure performance?
The ones known to the author of this humble guide are
- SQLite: They use mainly cpu instructions to measure performance improvements (and regressions).
- Also in benchmarks of the rustc compiler, instruction counts play a great role. But, they also use cache metrics and cycles.
If you know of others, please feel free to add them to this list.
Other Valgrind Tools
In addition to the default benchmarks, you can use the Iai-Callgrind framework
to run other Valgrind profiling Tool
s like DHAT
, Massif
and the
experimental BBV
but also Memcheck
, Helgrind
and DRD
if you need to
check memory and thread safety of benchmarked code. See also the Valgrind User
Manual for more details and
command line arguments. The additional tools can be specified in a
LibraryBenchmarkConfig
or BinaryBenchmarkConfig
. For example to run DHAT
for all library benchmarks in addition to Callgrind
:
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, Tool, ValgrindTool }; use std::hint::black_box; #[library_benchmark] fn bench_library() -> Vec<i32> { black_box(my_lib::bubble_sort(vec![3, 2, 1])) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!( config = LibraryBenchmarkConfig::default() .tool(Tool::new(ValgrindTool::DHAT)); library_benchmark_groups = my_group ); }
All tools which produce an ERROR SUMMARY
(Memcheck, DRD, Helgrind)
have
--error-exitcode=201
set, so if there are any errors, the benchmark run fails with 201
. You can
overwrite this default with
#![allow(unused)] fn main() { extern crate iai_callgrind; use iai_callgrind::{Tool, ValgrindTool}; Tool::new(ValgrindTool::Memcheck).args(["--error-exitcode=0"]); }
which would restore the default of 0
from valgrind.
Valgrind Client Requests
Iai-Callgrind ships with its own interface to the Valgrind's Client Request Mechanism. Iai-Callgrind's client requests have zero overhead (relative to the "C" implementation of Valgrind) on many targets which are also natively supported by valgrind. In short, Iai-Callgrind provides a complete and performant implementation of Valgrind Client Requests.
Installation
Client requests are deactivated by default but can be activated with the
client_requests
feature.
[dev-dependencies]
iai-callgrind = { version = "0.14.0", features = ["client_requests"] }
If you need the client requests in your production code, you don't want them to
do anything when not running under valgrind with Iai-Callgrind benchmarks. You
can achieve that by adding Iai-Callgrind with the client_requests_defs
feature
to your runtime dependencies and with the client_requests
feature to your
dev-dependencies
like so:
[dependencies]
iai-callgrind = { version = "0.14.0", default-features = false, features = [
"client_requests_defs"
] }
[dev-dependencies]
iai-callgrind = { version = "0.14.0", features = ["client_requests"] }
With just the client_requests_defs
feature activated, the client requests
compile down to nothing and don't add any overhead to your production code. It
simply provides the "definitions", method signatures and macros without body.
Only with the activated client_requests
feature they will be actually
executed. Note that the client requests do not depend on any other part of
Iai-Callgrind, so you could even use the client requests without the rest of
Iai-Callgrind.
When building Iai-Callgrind with client requests, the valgrind header files must
exist in your standard include path (most of the time /usr/include
). This is
usually the case if you've installed valgrind with your distribution's package
manager. If not, you can point the IAI_CALLGRIND_VALGRIND_INCLUDE
or
IAI_CALLGRIND_<triple>_VALGRIND_INCLUDE
environment variables to the include
path. So, if the headers can be found in /home/foo/repo/valgrind/{valgrind.h, callgrind.h, ...}
, the correct include path would be
IAI_CALLGRIND_VALGRIND_INCLUDE=/home/foo/repo
(not /home/foo/repo/valgrind
)
Usage
Use them in your code for example like so:
extern crate iai_callgrind; use iai_callgrind::client_requests; fn main() { fn main() { // Start callgrind event counting if not already started earlier client_requests::callgrind::start_instrumentation(); // do something important // Switch event counting off client_requests::callgrind::stop_instrumentation(); } }
Library Benchmarks
In library benchmarks you might need to
use EntryPoint::None
in order to make the client requests work
as expected:
extern crate iai_callgrind; use iai_callgrind::{main, library_benchmark_group, library_benchmark}; use std::hint::black_box; pub mod my_lib { #[inline(never)] fn bubble_sort(input: Vec<i32>) -> Vec<i32> { // The algorithm input } pub fn pre_bubble_sort(input: Vec<i32>) -> Vec<i32> { println!("Doing something before the function call"); iai_callgrind::client_requests::callgrind::start_instrumentation(); let result = bubble_sort(input); iai_callgrind::client_requests::callgrind::stop_instrumentation(); result } } #[library_benchmark] #[bench::small(vec![3, 2, 1])] #[bench::bigger(vec![5, 4, 3, 2, 1])] fn bench_function(array: Vec<i32>) -> Vec<i32> { black_box(my_lib::pre_bubble_sort(array)) } library_benchmark_group!(name = my_group; benchmarks = bench_function); fn main() { main!(library_benchmark_groups = my_group); }
The default EntryPoint
sets the --toggle-collect
to the benchmark function (here bench_function
) and
--collect-at-start=no
. So, Callgrind
starts collecting the events when
entering the benchmark function, not the moment start_instrumentation
is
called. This behaviour can be remedied with EntryPoint::None
:
extern crate iai_callgrind; use iai_callgrind::{ main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig, client_requests, EntryPoint }; use std::hint::black_box; pub mod my_lib { #[inline(never)] fn bubble_sort(input: Vec<i32>) -> Vec<i32> { // The algorithm input } pub fn pre_bubble_sort(input: Vec<i32>) -> Vec<i32> { println!("Doing something before the function call"); iai_callgrind::client_requests::callgrind::start_instrumentation(); let result = bubble_sort(input); iai_callgrind::client_requests::callgrind::stop_instrumentation(); result } } #[library_benchmark( config = LibraryBenchmarkConfig::default() .callgrind_args(["--collect-at-start=no"]) .entry_point(EntryPoint::None) )] #[bench::small(vec![3, 2, 1])] #[bench::bigger(vec![5, 4, 3, 2, 1])] fn bench_function(array: Vec<i32>) -> Vec<i32> { black_box(my_lib::pre_bubble_sort(array)) } library_benchmark_group!(name = my_group; benchmarks = bench_function); fn main() { main!(library_benchmark_groups = my_group); }
As the standard toggle is now switched off and the option
--collect-at-start=no
is also omitted, you must specify
--collect-at-start=no
manually in
LibraryBenchmarkConfig::raw_callgrind_args
.
Please see the
docs
for
more details!
Callgrind Flamegraphs
Flamegraphs are opt-in and can be created if you pass a FlamegraphConfig
to
the BinaryBenchmarkConfig
or LibraryBenchmarkConfig
. Callgrind flamegraphs
are meant as a complement to valgrind's visualization tools
callgrind_annotate
and kcachegrind
.
For example create all kind of flamegraphs for all benchmarks in a library benchmark:
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, FlamegraphConfig }; use std::hint::black_box; #[library_benchmark] fn bench_library() -> Vec<i32> { black_box(my_lib::bubble_sort(vec![3, 2, 1])) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!( config = LibraryBenchmarkConfig::default() .flamegraph(FlamegraphConfig::default()); library_benchmark_groups = my_group ); }
The produced flamegraph *.svg
files are located next to the respective
callgrind output file in the target/iai
directory.
Regular Flamegraphs
Regular callgrind flamegraphs show the inclusive costs for functions and a
single EventKind
(default is EventKind::Ir
), similar to
callgrind_annotate
. Suppose the example from above is stored in a benchmark
iai_callgrind_benchmark
:
If you open this image in a new tab, you can play around with the svg.
Differential Flamegraphs
Differential flamegraphs facilitate a deeper understanding of code sections which cause a bottleneck or a performance regressions etc.
We simulated a small change in bubble_sort
and in the differential flamegraph
you can spot fairly easily where the increase of Instructions
is happening.
(Experimental) Create flamegraphs for multi-threaded/multi-process benchmarks
Note the following only affects flamegraphs of multi-threaded/multi-process benchmarks and benchmarks which produce multiple parts with a total over all sub-metrics.
Currently, Iai-Callgrind creates the flamegraphs only for the total over all threads/parts and subprocesses. This leads to complications since the call graph is not be fully recovered just by examining each thread/subprocess separately. So, the total metrics in the flamegraphs might not be the same as the total metrics shown in the terminal output. If in doubt, the terminal output shows the the correct metrics.
Basic usage
It's possible to pass arguments to Iai-Callgrind separated by --
(cargo bench -- ARGS
). If you're running into the error Unrecognized Option
, see
Troubleshooting.
For a complete rundown of possible arguments, execute cargo bench --bench <benchmark> -- --help
. Almost all command-line arguments have a corresponding
environment variable. The environment variables which don't have a corresponding
command-line argument are:
IAI_CALLGRIND_COLOR
: Control the colored output of Iai-Callgrind (Default isauto
)IAI_CALLGRIND_LOG
: Define the log level (Default isWARN
)
The command-line arguments
High-precision and consistent benchmarking framework/harness for Rust
Boolish command line arguments take also one of `y`, `yes`, `t`, `true`, `on`,
`1`
instead of `true` and one of `n`, `no`, `f`, `false`, `off`, and `0` instead of
`false`
Usage: cargo bench ... [BENCHNAME] -- [OPTIONS]
Arguments:
[BENCHNAME]
If specified, only run benches containing this string in their names
Note that a benchmark name might differ from the benchmark file name.
[env: IAI_CALLGRIND_FILTER=]
Options:
--callgrind-args <CALLGRIND_ARGS>
The raw arguments to pass through to Callgrind
This is a space separated list of command-line-arguments specified as
if they were
passed directly to valgrind.
Examples:
* --callgrind-args=--dump-instr=yes
* --callgrind-args='--dump-instr=yes --collect-systime=yes'
[env: IAI_CALLGRIND_CALLGRIND_ARGS=]
--save-summary[=<SAVE_SUMMARY>]
Save a machine-readable summary of each benchmark run in json format
next to the usual benchmark output
[env: IAI_CALLGRIND_SAVE_SUMMARY=]
Possible values:
- json: The format in a space optimal json representation
without newlines
- pretty-json: The format in pretty printed json
--allow-aslr[=<ALLOW_ASLR>]
Allow ASLR (Address Space Layout Randomization)
If possible, ASLR is disabled on platforms that support it (linux,
freebsd) because ASLR could noise up the callgrind cache simulation results a
bit. Setting this option to true runs all benchmarks with ASLR enabled.
See also
<https://docs.kernel.org/admin-guide/sysctl/kernel.html?
highlight=randomize_va_space#randomize-va-space>
[env: IAI_CALLGRIND_ALLOW_ASLR=]
[possible values: true, false]
--regression <REGRESSION>
Set performance regression limits for specific `EventKinds`
This is a `,` separate list of EventKind=limit (key=value) pairs with
the limit being a positive or negative percentage. If positive, a performance
regression check for this `EventKind` fails if the limit is exceeded. If
negative, the regression check fails if the value comes below the limit. The
`EventKind` is matched case-insensitive. For a list of valid `EventKinds` see
the docs:
<https://docs.rs/iai-callgrind/latest/iai_callgrind/enum.EventKind.html>
Examples: --regression='ir=0.0' or --regression='ir=0,
EstimatedCycles=10'
[env: IAI_CALLGRIND_REGRESSION=]
--regression-fail-fast[=<REGRESSION_FAIL_FAST>]
If true, the first failed performance regression check fails the
whole benchmark run
This option requires `--regression=...` or
`IAI_CALLGRIND_REGRESSION=...` to be present.
[env: IAI_CALLGRIND_REGRESSION_FAIL_FAST=]
[possible values: true, false]
--save-baseline[=<SAVE_BASELINE>]
Compare against this baseline if present and then overwrite it
[env: IAI_CALLGRIND_SAVE_BASELINE=]
--baseline[=<BASELINE>]
Compare against this baseline if present but do not overwrite it
[env: IAI_CALLGRIND_BASELINE=]
--load-baseline[=<LOAD_BASELINE>]
Load this baseline as the new data set instead of creating a new one
[env: IAI_CALLGRIND_LOAD_BASELINE=]
--output-format <OUTPUT_FORMAT>
The terminal output format in default human-readable format or in
machine-readable json format
# The JSON Output Format
The json terminal output schema is the same as the schema with the
`--save-summary` argument when saving to a `summary.json` file. All other
output than the json output goes to stderr and only the summary output goes to
stdout. When not printing pretty json, each line is a dictionary summarizing a
single benchmark. You can combine all lines (benchmarks) into an array for
example with `jq`
`cargo bench -- --output-format=json | jq -s`
which transforms `{...}\n{...}` into `[{...},{...}]`
[env: IAI_CALLGRIND_OUTPUT_FORMAT=]
[default: default]
[possible values: default, json, pretty-json]
--separate-targets[=<SEPARATE_TARGETS>]
Separate iai-callgrind benchmark output files by target
The default output path for files created by iai-callgrind and
valgrind during the benchmark is
`target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID`.
This can be problematic if you're running the benchmarks not only for
a single target because you end up comparing the benchmark runs with the wrong
targets. Setting this option changes the default output path to
`target/iai/$TARGET/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/
$BENCH_FUNCTION.$BENCH_ID`
Although not as comfortable and strict, you could achieve a
separation by target also with baselines and a combination of
`--save-baseline=$TARGET` and `--baseline=$TARGET` if you prefer having all
files of a single $BENCH in the same directory.
[env: IAI_CALLGRIND_SEPARATE_TARGETS=]
[default: false]
[possible values: true, false]
--home <HOME>
Specify the home directory of iai-callgrind benchmark output files
All output files are per default stored under the
`$PROJECT_ROOT/target/iai` directory. This option lets you customize this
home directory, and it will be created if it doesn't exist.
[env: IAI_CALLGRIND_HOME=]
--nocapture[=<NOCAPTURE>]
Don't capture terminal output of benchmarks
Possible values are one of [true, false, stdout, stderr].
This option is currently restricted to the `callgrind` run of
benchmarks. The output of additional tool runs like DHAT, Memcheck, ... is
still captured, to prevent showing the same output of benchmarks multiple
times. Use `IAI_CALLGRIND_LOG=info` to also show captured and logged output.
If no value is given, the default missing value is `true` and doesn't
capture stdout and stderr. Besides `true` or `false` you can specify the
special values `stdout` or `stderr`. If `--nocapture=stdout` is given, the
output to `stdout` won't be captured and the output to `stderr` will be
discarded. Likewise, if `--nocapture=stderr` is specified, the output to
`stderr` won't be captured and the output to `stdout` will be discarded.
[env: IAI_CALLGRIND_NOCAPTURE=]
[default: false]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Comparing with baselines
Usually, two consecutive benchmark runs let Iai-Callgrind compare these two runs. It's sometimes desirable to compare the current benchmark run against a static reference, instead. For example, if you're working longer on the implementation of a feature, you may wish to compare against a baseline from another branch or the commit from which you started off hacking on your new feature to make sure you haven't introduced performance regressions. Iai-Callgrind offers such custom baselines. If you are familiar with criterion.rs, the following command line arguments should also be very familiar to you:
--save-baseline=NAME
(env:IAI_CALLGRIND_SAVE_BASELINE
): Compare against theNAME
baseline if present and then overwrite it.--baseline=NAME
(env:IAI_CALLGRIND_BASELINE
): Compare against theNAME
baseline without overwriting it--load-baseline=NAME
(env:IAI_CALLGRIND_LOAD_BASELINE
): Load theNAME
baseline as thenew
data set instead of creating a new one. This option needs also--baseline=NAME
to be present.
If NAME
is not present, NAME
defaults to default
.
For example to create a static reference from the main branch and compare it:
git checkout main
cargo bench --bench <benchmark> -- --save-baseline=main
git checkout feature
# ... HACK ... HACK
cargo bench --bench <benchmark> -- --baseline main
Sticking to the above execution sequence,
cargo bench --bench my_benchmark -- --save-baseline=main
prints something like that with an additional line Baselines
in the output.
my_benchmark::my_group::bench_library
Baselines: main|main
Instructions: 280|N/A (*********)
L1 Hits: 374|N/A (*********)
L2 Hits: 1|N/A (*********)
RAM Hits: 6|N/A (*********)
Total read+write: 381|N/A (*********)
Estimated Cycles: 589|N/A (*********)
After you've made some changes to your code, running
cargo bench --bench my_benchmark -- --baseline=main`
prints something like the following:
my_benchmark::my_group::bench_library
Baselines: |main
Instructions: 214|280 (-23.5714%) [-1.30841x]
L1 Hits: 287|374 (-23.2620%) [-1.30314x]
L2 Hits: 1|1 (No change)
RAM Hits: 6|6 (No change)
Total read+write: 294|381 (-22.8346%) [-1.29592x]
Estimated Cycles: 502|589 (-14.7708%) [-1.17331x]
Controlling the output of Iai-Callgrind
This section describes command-line options and environment variables which influence the terminal, file and logging output of Iai-Callgrind.
Customize the output directory
All output files of Iai-Callgrind are usually stored using the following scheme:
$WORKSPACE_ROOT/target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID
This directory structure can partly be changed with the following options.
Callgrind Home
Per default, all benchmark output files are stored under the
$WORKSPACE_ROOT/target/iai
directory tree. This home directory can be changed
with the IAI_CALLGRIND_HOME
environment variable or the command-line argument
--home
. The command-line argument overwrites the value of the environment
variable. For example to store all files under the /tmp/iai-callgrind
directory you can use IAI_CALLGRIND_HOME=/tmp/iai-callgrind
or cargo bench -- --home=/tmp/iai-callgrind
.
Separate targets
If you're running the benchmarks on different targets, it's necessary to
separate the output files of the benchmark runs per target or else you could end
up comparing the benchmarks with the wrong target leading to strange results.
You can achieve this with different baselines per target, but it's much less
painful to separate the output files by target with the --separate-targets
command-line argument or setting the environment variable
IAI_CALLGRIND_SEPARATE_TARGETS=yes
). The output directory structure
changes from
target/iai/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID
to
target/iai/$TARGET_TRIPLE/$PACKAGE_NAME/$BENCHMARK_FILE/$GROUP/$BENCH_FUNCTION.$BENCH_ID
For example, assuming the library benchmark file name is bench_file
in the
package my_package
extern crate iai_callgrind; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use iai_callgrind::{main, library_benchmark_group, library_benchmark}; use std::hint::black_box; #[library_benchmark] #[bench::short(vec![4, 3, 2, 1])] fn bench_bubble_sort(values: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(values)) } library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort); fn main() { main!(library_benchmark_groups = my_group); }
Without --separate-targets
:
target/iai/my_package/bench_file/my_group/bench_bubble_sort.short
and with --separate-targets
assuming you're running the benchmark on the
x86_64-unknown-linux-gnu
target:
target/iai/x86_64-unknown-linux-gnu/my_package/bench_file/my_group/bench_bubble_sort.short
Machine-readable output
With --output-format=default|json|pretty-json
(env:
IAI_CALLGRIND_OUTPUT_FORMAT
) you can change the terminal output format to the
machine-readable json format. The json schema fully describing the json output
is stored in
summary.v2.schema.json.
Each line of json output (if not pretty-json
) is a summary of a single
benchmark, and you may want to combine all benchmarks in an array. You can do so
for example with jq
cargo bench -- --output-format=json | jq -s
which transforms {...}\n{...}
into [{...},{...}]
.
Instead of, or in addition to changing the terminal output, it's possible to
save a summary file for each benchmark with --save-summary=json|pretty-json
(env: IAI_CALLGRIND_SAVE_SUMMARY
). The summary.json
files are stored next to
the usual benchmark output files in the target/iai
directory.
Showing terminal output of benchmarks
Per default, all terminal output of the benchmark function, setup
and
teardown
is captured and therefore not shown during a benchmark run.
Using the log level
The most basic possibility to show any captured output, is to use
IAI_CALLGRIND_LOG=info
. This includes a lot of other output,
too.
Tell Iai-Callgrind to not capture the output
Another nicer possibility is, to tell Iai-Callgrind to not capture output with
the --nocapture
(env: IAI_CALLGRIND_NOCAPTURE
) option. This is currently
restricted to the callgrind
run to prevent showing the same output multiple
times. So, any terminal output of other tool runs is still
captured.
The --nocapture
flag takes the special values stdout
and stderr
in
addition to true
and false
:
--nocapture=true|false|stdout|stderr
In the --nocapture=stdout
case, terminal output to stdout
is not captured
and shown during the benchmark run but output to stderr
is discarded.
Likewise, --nocapture=stderr
shows terminal output to stderr
but discards
output to stdout
.
Let's take as example a library benchmark benches/my_benchmark.rs
extern crate iai_callgrind; use iai_callgrind::{library_benchmark, library_benchmark_group, main}; use std::hint::black_box; fn print_to_stderr(value: u64) { eprintln!("Error output during teardown: {value}"); } fn add_10_and_print(value: u64) -> u64 { let value = value + 10; println!("Output to stdout: {value}"); value } #[library_benchmark] #[bench::some_id(args = (10), teardown = print_to_stderr)] fn bench_library(value: u64) -> u64 { black_box(add_10_and_print(value)) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!(library_benchmark_groups = my_group); }
If the above benchmark is run with cargo bench --bench my_benchmark -- --nocapture
, the output of Iai-Callgrind will look like this:
my_benchmark::my_group::bench_library some_id:10
Output to stdout: 20
Error output during teardown: 20
- end of stdout/stderr
Instructions: 851|N/A (*********)
L1 Hits: 1193|N/A (*********)
L2 Hits: 5|N/A (*********)
RAM Hits: 66|N/A (*********)
Total read+write: 1264|N/A (*********)
Estimated Cycles: 3528|N/A (*********)
Everything between the headline and the - end of stdout/stderr
line is output
from your benchmark. The - end of stdout/stderr
line changes depending on the
options you have given. For example in the --nocapture=stdout
case this line
indicates your chosen option with - end of stdout
.
Note that independently of the value of the --nocapture
option, all logging
output of a valgrind tool itself is stored in files in the output directory of
the benchmark. Since Iai-Callgrind needs the logging output of valgrind tools
stored in files, there is no option to disable the creation of these log files.
But, if anything goes sideways you might be glad to have the log files around.
Changing the color output
The terminal output is colored per default but follows the value for the
IAI_CALLGRIND_COLOR
environment variable. If IAI_CALLGRIND_COLOR
is not set,
CARGO_TERM_COLOR
is also tried. Accepted values are:
always
, never
, auto
(default).
So, disabling colors can be achieved with setting IAI_CALLGRIND_COLOR
or
CARGO_TERM_COLOR=never
.
Changing the logging output
Iai-Callgrind uses env_logger and the
default logging level WARN
. To set the logging level to something different,
set the environment variable IAI_CALLGRIND_LOG
for example to
IAI_CALLGRIND_LOG=DEBUG
. Accepted values are:
error
, warn
(default), info
, debug
, trace
.
The logging output is colored per default but follows the Color settings.
See also the documentation of env_logger
.
I'm getting the error Sentinel ... not found
You've most likely disabled creating debug symbols in your cargo bench
profile. This can originate in an option you've added to the release
profile
since the bench
profile inherits the release
profile. For example, if you've
added strip = true
to your release
profile which is perfectly fine, you need
to disable this option in your bench
profile to be able to run Iai-Callgrind
benchmarks.
See also the Debug Symbols section in Installation/Prerequisites.
Running cargo bench results in an "Unrecognized Option" error
For
cargo bench -- --some-valid-arg
to work you can either specify the
benchmark with --bench BENCHMARK
, for example
cargo bench --bench my_iai_benchmark -- --callgrind-args="--collect-bus=yes"
or add the following to your Cargo.toml
:
[lib]
bench = false
and if you have binaries
[[bin]]
name = "my-binary"
path = "src/bin/my-binary.rs"
bench = false
Setting bench = false
disables the creation of the implicit default libtest
harness which is added even if you haven't used #[bench]
functions in your
library or binary. Naturally, the default harness doesn't know of the
Iai-Callgrind arguments and aborts execution printing the Unrecognized Option
error.
If you cannot or don't want to add bench = false
to your Cargo.toml
, you can
alternatively use environment variables. For every command-line
argument exists a corresponding environment variable.
Comparison of Iai-Callgrind with Criterion-rs
This is a comparison with Criterion-rs but some of the points in Pros and Cons also apply to other wall-clock time based benchmarking frameworks.
Iai-Callgrind Pros:
-
Iai-Callgrind can give answers that are repeatable to 7 or more significant digits. In comparison, actual (wall-clock) run times are scarcely repeatable beyond one significant digit.
This allows to implement and measure "microoptimizations". Typical microoptimizations reduce the number of CPU cycles by
0.1%
or0.05%
or even less. Such improvements are impossible to measure with real-world timings. But hundreds or thousands of microoptimizations add up, resulting in measurable real-world performance gains.1 -
Iai-Callgrind can work reliably in noisy environments especially in CI environments from providers like GitHub Actions or Travis-CI, where Criterion-rs cannot.
-
The benchmark api of Iai-Callgrind is simple, intuitive and allows for a much more concise and clearer structure of benchmarks.
-
Iai-Callgrind can benchmark functions in binary crates.
-
Iai-Callgrind can benchmark private functions.
-
Although Callgrind adds runtime overhead, running each benchmark exactly once is still usually much faster than Criterion-rs' statistical measurements.
-
Criterion-rs creates plots and graphs about the averages, median etc. which adds considerable execution time to the execution time for each benchmark. Iai-Callgrind doesn't need any of these plots, since it can collect all its metrics in a single run.
-
Iai-Callgrind generates profile output from the benchmark without further effort.
-
With Iai-Callgrind you have native access to all the possibilities of all Valgrind tools, including Valgrind Client Requests.
Iai-Callgrind/Criterion-rs Mixed:
- Although it is usually not significant, due to the high precision of the
Iai-Callgrind measurements changes in the benchmarks themselves like adding a
benchmark case can have an effect on the other benchmarks. Iai-Callgrind can only try to reduce these effects to a minimum but never completely eliminate them. Criterion-rs does not have this problem because it cannot detect such small changes.
Iai-Callgrind Cons:
- Iai-Callgrind's measurements merely correlate with wall-clock time. Wall-clock time is an obvious choice in many cases because it corresponds to what users perceive and Criterion-rs measures it directly.
- Iai-Callgrind can only be used on platforms supported by Valgrind. Notably, this does not include Windows.
- Iai-Callgrind needs additional binaries,
valgrind
and theiai-callgrind-runner
. The version of the runner needs to be in sync with theiai-callgrind
library. Criterion-rs is only a library and the installation is usually simpler.
Especially, due to the first point in the Cons
, I think it is still required
to run wall-clock time benchmarks and use Criterion-rs
in conjunction with
Iai-Callgrind. But in the CI and for performance regression checks, you
shouldn't use Criterion-rs
or other wall-clock time based benchmarks at all.
Comparison of Iai-Callgrind with Iai
This is a comparison with Iai, from which Iai-Callgrind is forked over a year ago.
Iai-Callgrind Pros:
-
Iai-Callgrind is actively maintained.
-
The benchmark api of Iai-Callgrind is simple, intuitive and allows for a much more concise and clearer structure of benchmarks.
-
More stable metrics because the benchmark function is virtually encapsulated by Callgrind and separates the benchmarked code from the surrounding code.
-
Iai-Callgrind excludes setup code from the metrics natively.
-
The Callgrind output files are much more focused on the benchmark function and the function under test than the Cachegrind output files that Iai produces. The calibration run of Iai only sanitized the visible summary output but not the metrics in the output files themselves. So, the output of
cg_annotate
was still cluttered by the initialization code, setup functions and metrics. -
Changes to the library of Iai-Callgrind have almost never an influence on the benchmark metrics, since the actual runner (
iai-callgrind-runner
) and thus99%
of the code needed to run the benchmarks is isolated from the benchmarks by an independent binary. In contrast to the library of Iai which is compiled together with the benchmarks. -
Iai-Callgrind has functionality in place that provides a more constant environment, like the
Sandbox
and clearing environment variables. -
Supports running other Valgrind Tools, like DHAT, Massif etc.
-
Comparison of benchmark functions.
-
Iai-Callgrind can be configured to check for performance regressions.
-
A complete implementation of Valgrind Client Requests is available in Iai-Callgrind itself.
-
Comparison of benchmarks to baselines instead of only to
.old
files. -
Iai-Callgrind natively supports benchmarking binaries.
-
Iai-Callgrind can print machine-readable output in
.json
format.
I don't see any downside in using Iai-Callgrind instead of Iai.