Multi-threaded and multi-process applications

The default is to run Iai-Callgrind benchmarks with --separate-threads=yes, --trace-children=yes switched on. This enables Iai-Callgrind to trace threads and subprocesses, respectively. Note that --separate-threads=yes is not strictly necessary to be able to trace threads. But, if they are separated, Iai-Callgrind can collect and display the metrics for each thread. Due to the way callgrind applies data collection options like --toggle-collect, --collect-atstart, ... further configuration is needed in library benchmarks.

To actually see the collected metrics in the terminal output for all threads and/or subprocesses you can switch on OutputFormat::show_intermediate:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: u64) -> Vec<u64> { vec![]} }
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

#[library_benchmark]
fn bench_threads() -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(2))
}

library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

The best method for benchmarking threads and subprocesses depends heavily on your code. So, rather than suggesting a single "best" method for benchmarking threads and subprocesses, this chapter will run through various possible approaches and try to highlight the pros and cons of each.

Multi-threaded applications

Callgrind treats each thread and process as a separate unit and it applies data collection options to each unit. In library benchmarks the entry point (or the default toggle) for callgrind is per default set to the benchmark function with the help of the --toggle-collect option. Setting --toggle-collect also automatically sets --collect-atstart=no. If not further customized for a benchmarked multi-threaded function, these options cause the metrics for the spawned threads to be zero. This happens since each thread is a separate unit with --collect-atstart=no and the default toggle applied to the units. The default toggle is set to the benchmark function and does not hook into any function in the thread, so the metrics are zero.

There are multiple ways to customize the default behaviour and actually measure the threads. For the following examples, we're using the benchmark and library code below to show the different customization options assuming this code lives in a benchmark file benches/lib_bench_threads.rs

extern crate iai_callgrind;
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    OutputFormat
};
use std::hint::black_box;

/// Suppose this is your library
pub mod my_lib {
    /// Return true if `num` is a prime number
    pub fn is_prime(num: u64) -> bool {
        if num <= 1 {
            return false;
        }

        for i in 2..=(num as f64).sqrt() as u64 {
            if num % i == 0 {
                return false;
            }
        }

        true
    }

    /// Find and return all prime numbers in the inclusive range `low` to `high`
    pub fn find_primes(low: u64, high: u64) -> Vec<u64> {
        (low..=high).filter(|n| is_prime(*n)).collect()
    }

    /// Return the prime numbers in the range `0..(num_threads * 10000)`
    pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> {
        let mut handles = vec![];
        let mut low = 0;
        for _ in 0..num_threads {
            let handle = std::thread::spawn(move || find_primes(low, low + 10000));
            handles.push(handle);

            low += 10000;
        }

        let mut primes = vec![];
        for handle in handles {
            let result = handle.join();
            primes.extend(result.unwrap())
        }

        primes
    }
}

#[library_benchmark]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}

library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

Running this benchmark with cargo bench will present you with the following terminal output:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2097219 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66353|N/A                  (*********)
  L2 Hits:                              341|N/A                  (*********)
  RAM Hits:                             539|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86923|N/A                  (*********)
  ## pid: 2097219 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2097219 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66353|N/A                  (*********)
  L2 Hits:                              341|N/A                  (*********)
  RAM Hits:                             539|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86923|N/A                  (*********)

As you can see, the counts for the threads 2 and 3 (our spawned threads) are all zero.

Measuring threads using toggles

At a first glance, setting a toggle to the function in the thread seems to be easiest way and can be done like so:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .callgrind_args(["--toggle-collect=lib_bench_threads::my_lib::find_primes"])
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

This approach may or may not work, depending on whether the compiler inlines the target function of the --toggle-collect argument or not. This is the same problem as with custom entry points. As can be seen below, the compiler has chosen to inline find_primes and the metrics for the threads are still zero:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2620776 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  L2 Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)
  ## pid: 2620776 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2620776 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  L2 Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)

Just to show what would happen if the compiler does not inline the find_primes method, we temporarily annotate it with #[inline(never)]:

#![allow(unused)]
fn main() {
/// Find and return all prime numbers in the inclusive range `low` to `high`
fn is_prime(_: u64) -> bool { true }
#[inline(never)]
pub fn find_primes(low: u64, high: u64) -> Vec<u64> {
    (low..=high).filter(|n| is_prime(*n)).collect()
}
}

Now, running the benchmark does show the desired metrics:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2661917 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27372|N/A                  (*********)
  L1 Hits:                            66431|N/A                  (*********)
  L2 Hits:                              343|N/A                  (*********)
  RAM Hits:                             538|N/A                  (*********)
  Total read+write:                   67312|N/A                  (*********)
  Estimated Cycles:                   86976|N/A                  (*********)
  ## pid: 2661917 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460503|N/A                  (*********)
  L1 Hits:                          2534938|N/A                  (*********)
  L2 Hits:                               12|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                 2535136|N/A                  (*********)
  Estimated Cycles:                 2541508|N/A                  (*********)
  ## pid: 2661917 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650410|N/A                  (*********)
  L1 Hits:                          3724286|N/A                  (*********)
  L2 Hits:                                4|N/A                  (*********)
  RAM Hits:                             130|N/A                  (*********)
  Total read+write:                 3724420|N/A                  (*********)
  Estimated Cycles:                 3728856|N/A                  (*********)
  ## Total
  Instructions:                     6138285|N/A                  (*********)
  L1 Hits:                          6325655|N/A                  (*********)
  L2 Hits:                              359|N/A                  (*********)
  RAM Hits:                             854|N/A                  (*********)
  Total read+write:                 6326868|N/A                  (*********)
  Estimated Cycles:                 6357340|N/A                  (*********)

But, annotating functions with #[inline(never)] in production code is usually not an option and preventing the compiler from doing its job is not the preferred way to make a benchmark work. The truth is, there is no way to make the --toggle-collect argument work for all cases and it heavily depends on the choices of the compiler depending on your code.

Another way to get the thread metrics is to set --collect-atstart=yes and turn off the EntryPoint:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .entry_point(EntryPoint::None)
        .callgrind_args(["--collect-atstart=yes"])
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

But, the metrics of the main thread will include all the setup (and teardown) code from the benchmark executable (so the instructions of the main thread go up from 27372 to 404425):

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2697019 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                      404425|N/A                  (*********)
  L1 Hits:                           570186|N/A                  (*********)
  L2 Hits:                             1307|N/A                  (*********)
  RAM Hits:                            4856|N/A                  (*********)
  Total read+write:                  576349|N/A                  (*********)
  Estimated Cycles:                  746681|N/A                  (*********)
  ## pid: 2697019 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2466864|N/A                  (*********)
  L1 Hits:                          2543314|N/A                  (*********)
  L2 Hits:                               81|N/A                  (*********)
  RAM Hits:                             409|N/A                  (*********)
  Total read+write:                 2543804|N/A                  (*********)
  Estimated Cycles:                 2558034|N/A                  (*********)
  ## pid: 2697019 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3656729|N/A                  (*********)
  L1 Hits:                          3732802|N/A                  (*********)
  L2 Hits:                               31|N/A                  (*********)
  RAM Hits:                             201|N/A                  (*********)
  Total read+write:                 3733034|N/A                  (*********)
  Estimated Cycles:                 3739992|N/A                  (*********)
  ## Total
  Instructions:                     6528018|N/A                  (*********)
  L1 Hits:                          6846302|N/A                  (*********)
  L2 Hits:                             1419|N/A                  (*********)
  RAM Hits:                            5466|N/A                  (*********)
  Total read+write:                 6853187|N/A                  (*********)
  Estimated Cycles:                 7044707|N/A                  (*********)

Additionally, expect a lot of metric changes if the benchmarks itself are changed. However, if the metrics of the main thread are not significant compared to the total, this might be an applicable (last) choice.

There is another more reliable way as shown below in the next section.

Measuring threads using client requests

The perhaps most reliable and flexible way to measure threads is using client requests. The downside is that you have to put some benchmark code into your production code. But, if you followed the installation instructions in client requests, this additional code is only compiled in benchmarks, not in your final production-ready library.

Using the callgrind client request, we adjust the threads in the find_primes_multi_thread function like so:

#![allow(unused)]
fn main() {
fn find_primes(_a: u64, _b: u64) -> Vec<u64> { vec![] }
extern crate iai_callgrind;
use iai_callgrind::client_requests::callgrind;

/// Return the prime numbers in the range `0..(num_threads * 10000)`
pub fn find_primes_multi_thread(num_threads: usize) -> Vec<u64> {
    let mut handles = vec![];
    let mut low = 0;
    for _ in 0..num_threads {
        let handle = std::thread::spawn(move || {
            callgrind::toggle_collect();
            let result = find_primes(low, low + 10000);
            callgrind::toggle_collect();
            result
        });
        handles.push(handle);

        low += 10000;
    }

    let mut primes = vec![];
    for handle in handles {
        let result = handle.join();
        primes.extend(result.unwrap())
    }

    primes
}
}

and running the same benchmark now will show the collected metrics of the threads:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2149242 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                       27305|N/A                  (*********)
  L1 Hits:                            66352|N/A                  (*********)
  L2 Hits:                              344|N/A                  (*********)
  RAM Hits:                             537|N/A                  (*********)
  Total read+write:                   67233|N/A                  (*********)
  Estimated Cycles:                   86867|N/A                  (*********)
  ## pid: 2149242 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460501|N/A                  (*********)
  L1 Hits:                          2534935|N/A                  (*********)
  L2 Hits:                               13|N/A                  (*********)
  RAM Hits:                             185|N/A                  (*********)
  Total read+write:                 2535133|N/A                  (*********)
  Estimated Cycles:                 2541475|N/A                  (*********)
  ## pid: 2149242 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650408|N/A                  (*********)
  L1 Hits:                          3724285|N/A                  (*********)
  L2 Hits:                                1|N/A                  (*********)
  RAM Hits:                             131|N/A                  (*********)
  Total read+write:                 3724417|N/A                  (*********)
  Estimated Cycles:                 3728875|N/A                  (*********)
  ## Total
  Instructions:                     6138214|N/A                  (*********)
  L1 Hits:                          6325572|N/A                  (*********)
  L2 Hits:                              358|N/A                  (*********)
  RAM Hits:                             853|N/A                  (*********)
  Total read+write:                 6326783|N/A                  (*********)
  Estimated Cycles:                 6357217|N/A                  (*********)

Using the client request toggles is very flexible since you can put the iai_callgrind::client_requests::callgrind::toggle_collect instructions anywhere in the threads. In this example, we just have a single function in the thread, but if your threads consist of more than just a single function, you can easily exclude uninteresting parts from the final measurements.

If you want to prevent the code of the main thread from being measured, you can use the following:

extern crate iai_callgrind;
mod my_lib { pub fn find_primes_multi_thread(_: usize) -> Vec<u64> { vec![] }}
use iai_callgrind::{
    main, library_benchmark_group, library_benchmark, LibraryBenchmarkConfig,
    EntryPoint
};
use std::hint::black_box;

#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .entry_point(EntryPoint::None)
        .callgrind_args(["--collect-atstart=no"])
)]
#[bench::two_threads(2)]
fn bench_threads(num_threads: usize) -> Vec<u64> {
    black_box(my_lib::find_primes_multi_thread(num_threads))
}
library_benchmark_group!(name = my_group; benchmarks = bench_threads);
fn main() {
main!(library_benchmark_groups = my_group);
}

Setting the EntryPoint::None disables the default toggle but also --collect-atstart=no, which is why we have to set the option manually. Altogether, running the benchmark will show:

lib_bench_threads::my_group::bench_threads two_threads:2
  ## pid: 2251257 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## pid: 2251257 thread: 2 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     2460501|N/A                  (*********)
  L1 Hits:                          2534935|N/A                  (*********)
  L2 Hits:                               11|N/A                  (*********)
  RAM Hits:                             187|N/A                  (*********)
  Total read+write:                 2535133|N/A                  (*********)
  Estimated Cycles:                 2541535|N/A                  (*********)
  ## pid: 2251257 thread: 3 part: 1        |N/A
  Command:             target/release/deps/lib_bench_threads-b85159a94ccb3851
  Instructions:                     3650408|N/A                  (*********)
  L1 Hits:                          3724282|N/A                  (*********)
  L2 Hits:                                4|N/A                  (*********)
  RAM Hits:                             131|N/A                  (*********)
  Total read+write:                 3724417|N/A                  (*********)
  Estimated Cycles:                 3728887|N/A                  (*********)
  ## Total
  Instructions:                     6110909|N/A                  (*********)
  L1 Hits:                          6259217|N/A                  (*********)
  L2 Hits:                               15|N/A                  (*********)
  RAM Hits:                             318|N/A                  (*********)
  Total read+write:                 6259550|N/A                  (*********)
  Estimated Cycles:                 6270422|N/A                  (*********)

Multi-process applications

Measuring multi-process applications is in principal not that different from multi-threaded applications since subprocesses are just like threads separate units. As for threads, the data collection options are applied to subprocesses separately from the main process.

Note there are multiple valgrind command-line arguments that can disable the collection of metrics for uninteresting subprocesses, for example subprocesses that are spawned by your library function but are not part of your library/binary crate.

For the following examples suppose the code below is the cat binary and part of a crate (so we can use env!("CARGO_BIN_EXE_cat")):

use std::fs::File;
use std::io::{copy, stdout, BufReader, BufWriter, Write};

fn main() {
fn main() {
    let mut args_iter = std::env::args().skip(1);
    let file_arg = args_iter.next().expect("File argument should be present");

    let file = File::open(file_arg).expect("Opening file should succeed");
    let stdout = stdout().lock();

    let mut writer = BufWriter::new(stdout);
    copy(&mut BufReader::new(file), &mut writer)
        .expect("Printing file to stdout should succeed");

    writer.flush().expect("Flushing writer should succeed");
}
}

The above binary is a very simple version of cat taking a single file argument. The file content is read and dumped to the stdout. The following is the benchmark and library code to show the different options assuming this code is stored in a benchmark file benches/lib_bench_subprocess.rs

extern crate iai_callgrind;
macro_rules! env { ($m:tt) => {{ "/some/path" }} }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;

use iai_callgrind::{
    library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
    OutputFormat,
};

/// Suppose this is your library
pub mod my_lib {
    use std::io;
    use std::path::Path;
    use std::process::ExitStatus;

    /// A function executing the crate's binary `cat`
    pub fn cat(file: &Path) -> io::Result<ExitStatus> {
        std::process::Command::new(env!("CARGO_BIN_EXE_cat"))
            .arg(file)
            .status()
    }
}

/// Create a file `/tmp/foo.txt` with some content
fn create_file() -> PathBuf {
    let path = PathBuf::from("/tmp/foo.txt");
    std::fs::write(&path, "some content").unwrap();
    path
}

#[library_benchmark]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}

library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(
    config = LibraryBenchmarkConfig::default()
        .output_format(OutputFormat::default()
            .show_intermediate(true)
        );
    library_benchmark_groups = my_group
);
}

Running the above benchmark with cargo bench results in the following terminal output:

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3141785 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  L2 Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)
  ## pid: 3141786 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                           0|N/A                  (*********)
  L1 Hits:                                0|N/A                  (*********)
  L2 Hits:                                0|N/A                  (*********)
  RAM Hits:                               0|N/A                  (*********)
  Total read+write:                       0|N/A                  (*********)
  Estimated Cycles:                       0|N/A                  (*********)
  ## Total
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  L2 Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)

As expected, the cat subprocess is not measured and the metrics are zero for the same reasons as the initial measurement of threads.

Measuring subprocesses using toggles

The great advantage over measuring threads is that each process has a main function that is not inlined by the compiler and can serve as a reliable hook for the --toggle-collect argument so the following adaption to the above benchmark will just work:

extern crate iai_callgrind;
mod my_lib {
use std::{io, path::Path, process::ExitStatus};
pub fn cat(_: &Path) -> io::Result<ExitStatus> {
   std::process::Command::new("some").status()
}}
fn create_file() -> PathBuf { PathBuf::from("some") }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;
use iai_callgrind::{
   library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
   OutputFormat,
};
#[library_benchmark(
    config = LibraryBenchmarkConfig::default()
        .callgrind_args(["--toggle-collect=cat::main"])
)]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}
library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(library_benchmark_groups = my_group);
}

producing the desired output

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3324117 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4475|N/A                  (*********)
  L1 Hits:                             6112|N/A                  (*********)
  L2 Hits:                               14|N/A                  (*********)
  RAM Hits:                             187|N/A                  (*********)
  Total read+write:                    6313|N/A                  (*********)
  Estimated Cycles:                   12727|N/A                  (*********)
  ## pid: 3324119 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                        4019|N/A                  (*********)
  L1 Hits:                             5575|N/A                  (*********)
  L2 Hits:                               12|N/A                  (*********)
  RAM Hits:                             167|N/A                  (*********)
  Total read+write:                    5754|N/A                  (*********)
  Estimated Cycles:                   11480|N/A                  (*********)
  ## Total
  Instructions:                        8494|N/A                  (*********)
  L1 Hits:                            11687|N/A                  (*********)
  L2 Hits:                               26|N/A                  (*********)
  RAM Hits:                             354|N/A                  (*********)
  Total read+write:                   12067|N/A                  (*********)
  Estimated Cycles:                   24207|N/A                  (*********)

Measuring subprocesses using client requests

Naturally, client requests can also be used to measure subprocesses. The callgrind client requests are added to the code of the cat binary:

extern crate iai_callgrind;
use std::fs::File;
use std::io::{copy, stdout, BufReader, BufWriter, Write};
use iai_callgrind::client_requests::callgrind;

fn main() {
fn main() {
    let mut args_iter = std::env::args().skip(1);
    let file_arg = args_iter.next().expect("File argument should be present");

    callgrind::toggle_collect();
    let file = File::open(file_arg).expect("Opening file should succeed");
    let stdout = stdout().lock();

    let mut writer = BufWriter::new(stdout);
    copy(&mut BufReader::new(file), &mut writer)
        .expect("Printing file to stdout should succeed");

    writer.flush().expect("Flushing writer should succeed");
    callgrind::toggle_collect();
}
}

For the purpose of this example we decided that measuring the parsing of the command-line-arguments is not interesting for us and excluded it from the collected metrics. The benchmark itself is reverted to its original state without the toggle:

extern crate iai_callgrind;
mod my_lib {
use std::{io, path::Path, process::ExitStatus};
pub fn cat(_: &Path) -> io::Result<ExitStatus> {
   std::process::Command::new("some").status()
}}
fn create_file() -> PathBuf { PathBuf::from("some") }
use std::hint::black_box;
use std::io;
use std::path::PathBuf;
use std::process::ExitStatus;
use iai_callgrind::{
   library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig,
   OutputFormat,
};
#[library_benchmark]
#[bench::some(setup = create_file)]
fn bench_subprocess(path: PathBuf) -> io::Result<ExitStatus> {
    black_box(my_lib::cat(&path))
}
library_benchmark_group!(name = my_group; benchmarks = bench_subprocess);
fn main() {
main!(library_benchmark_groups = my_group);
}

Now, running the benchmark shows

lib_bench_subprocess::my_group::bench_subprocess some:create_file()
  ## pid: 3421822 thread: 1 part: 1        |N/A
  Command:             target/release/deps/lib_bench_subprocess-a1b2e1eac5125819
  Instructions:                        4467|N/A                  (*********)
  L1 Hits:                             6102|N/A                  (*********)
  L2 Hits:                               17|N/A                  (*********)
  RAM Hits:                             186|N/A                  (*********)
  Total read+write:                    6305|N/A                  (*********)
  Estimated Cycles:                   12697|N/A                  (*********)
  ## pid: 3421823 thread: 1 part: 1        |N/A
  Command:             target/release/cat /tmp/foo.txt
  Instructions:                        2429|N/A                  (*********)
  L1 Hits:                             3406|N/A                  (*********)
  L2 Hits:                                8|N/A                  (*********)
  RAM Hits:                             138|N/A                  (*********)
  Total read+write:                    3552|N/A                  (*********)
  Estimated Cycles:                    8276|N/A                  (*********)
  ## Total
  Instructions:                        6896|N/A                  (*********)
  L1 Hits:                             9508|N/A                  (*********)
  L2 Hits:                               25|N/A                  (*********)
  RAM Hits:                             324|N/A                  (*********)
  Total read+write:                    9857|N/A                  (*********)
  Estimated Cycles:                   20973|N/A                  (*********)

As expected, the metrics for the cat binary are a little bit lower since we skipped measuring the parsing of the command-line arguments.