Suhaib Bin Younis

All Notes

Technology

Hyperfine CLI Benchmarking Tool Guide

An old stopwatch with glowing numbers.

Stop Guessing, Start Measuring

We’ve all done it: you write a new script, or you swap a Python loop for a list comprehension, and you run time my_script.py. The terminal spits out real 0.5s. You run it again. real 0.6s. You run it a third time. real 0.4s.

Which one is right? In a modern operating system, a single run is almost entirely meaningless. Background tasks (like your browser or a system update), CPU thermal throttling, and disk caching can all skew your results by 20% or more. To truly understand performance, you need a tool that respects the scientific method. You need Hyperfine.

1. The Statistical Fallacy of the `time` Command

The standard /usr/bin/time (or the shell builtin equivalents) is a simple wrapper. It starts a wall-clock, forks a child process, waits for it to exit, and records the elapsed time. While useful for a quick check, it is fundamentally flawed for serious benchmarking for several reasons:

Cold vs. Warm Caching

Operating systems are incredibly aggressive about caching. The first time you run a command that reads a 500MB file, the disk head has to move, and data flows over the SATA/NVMe bus. The second time, that data is likely already in the kernel’s page cache (RAM). time treats these two vastly different scenarios as identical data points.

System Jitter (Noise)

Your CPU isn’t just running your benchmark. It’s handling network interrupts, updating your clock, and maybe even context-switching to a rogue Electron app in the background. A single measurement captures all this “noise” without any way to isolate the “signal.”

Lack of Statistical Context

If time says 0.45s and another run says 0.48s, is that a meaningful difference or just statistical variance? Without a distribution of runs and a measure of standard deviation, you are essentially reading tea leaves.

2. Enter Hyperfine: Benchmarking with Rigor

Hyperfine, a modern benchmarking tool written in Rust by David Peter (sharkdp), addresses these issues by bringing statistical rigor to the command line. It doesn’t just run your code; it performs a controlled experiment.

Key Features that Elevate Your Benchmarking:

Warmup Runs: Prime the caches before recording data.
Multiple Runs: Defaults to 10 runs to build a reliable distribution.
Statistical Analysis: Provides Mean, Standard Deviation, and Min/Max.
Relative Comparison: Compare multiple commands with a single line.
Parameterized Scans: Test how performance scales with different inputs.
Outlier Detection: Warns you if results are inconsistent due to system load.

3. Mastering the Hyperfine Workflow

Let’s move beyond the basics and look at how a professional engineer uses Hyperfine.

The Basic Comparison

The most common use case is comparing two tools that do the same thing. For example, comparing find with the modern Rust-based fd:

hyperfine 'find . -name "*.md"' 'fd -e md'

Eliminating Cache Bias with Warmups

To ensure we aren’t measuring the speed of the disk on the first run and RAM on the subsequent ones, use the --warmup flag. This tells Hyperfine to run the command a specific number of times without recording the data.

hyperfine --warmup 3 'grep -r "TODO" .'

Controlling the Sample Size

If a command is very slow (e.g., a full project build), you might want fewer runs. If it’s very fast, you might want more to smooth out the jitter.

# Force exactly 5 runs for a slow build
hyperfine --runs 5 'make -j8'

# Use at least 100 runs for a micro-benchmark
hyperfine --min-runs 100 'echo "hello" | sed "s/h/j/"'

Parameterized Benchmarking (`-P`)

This is Hyperfine’s “killer feature.” It allows you to run a command while varying a specific parameter. This is invaluable for testing scalability.

# Test how 'xz' compression speed scales with levels 1 through 6
hyperfine --parameter-scan level 1 6 'xz -{level} -c my_large_file.tar > /dev/null'

4. Deep Dive: Understanding the Statistics

When Hyperfine finishes, it presents a summary. Understanding these numbers is the difference between a coder and a performance engineer.

Metric	Definition	Importance
Mean (μ)	The arithmetic average of all runs.	Your “average” expected performance.
StdDev (σ)	The amount of variation from the mean.	CRITICAL. A high StdDev (e.g., > 10% of the mean) indicates that your environment is noisy or the command’s performance is unstable.
Min / Max	The absolute extremes recorded.	Helps identify worst-case scenarios.
Relative Speed	Total time of Command A / Total time of Command B.	Tells you “Tool A is 2.5x faster than Tool B.”

Warning: Outlier Detection

If Hyperfine detects that one or two runs were significantly slower than others, it will warn you:

Warning: The results might be skewed because their origin is inconsistent. This usually means another process stole your CPU cycles during the benchmark. In these cases, you should close other apps and re-run.

5. Advanced Usage: Shells and Cleanup

Benchmarking Shell Features

By default, Hyperfine runs commands directly. However, if you want to benchmark shell built-ins or aliases, you need to specify a shell:

hyperfine --shell zsh 'for i in {1..1000}; do (true); done'

Cleaning Up After Yourself

If your command creates temporary files that shouldn’t be reused in the next run, use the --prepare flag. This runs a command before every single benchmark run.

hyperfine --prepare 'rm -rf ./build_cache' 'make'

6. Real-World Case Study: GCC vs. Clang

Let’s say you’re debating whether to switch your production build from GCC to Clang. A simple time run might show Clang is faster, but Hyperfine gives you the full story:

hyperfine --warmup 2 \
  --prepare 'make clean' \
  'make CC=gcc' \
  'make CC=clang' \
  --export-markdown compiler_battle.md

The exported Markdown will give you a professional table ready for your project’s documentation.

7. Tool Comparison: `time` vs. `hyperfine` vs. `perf`

Feature	`/usr/bin/time`	Hyperfine	`perf` (Linux)
Primary Goal	Single-shot timing	Statistical benchmarking	Low-level profiling
Warmups	No	Yes	No
Statistical Analysis	No	Yes	No
Relative Comparison	Manual	Automatic	No
System Overhead	Minimal	Low	Medium/High
Best For	“How long did that take?”	“Which tool is faster?”	“Why is my code slow?”

8. Common Pitfalls and FAQ

Q: Why is my standard deviation so high?

A: Usually, this is due to “system interference.” Close Chrome, Discord, and Slack. Also, check if your CPU is thermal throttling. On laptops, performance often drops after the first few runs as the fan kicks in.

Q: Can I use this for micro-benchmarking?

A: Hyperfine measures “process-level” performance (including startup time). If you want to measure a single C++ function or a Rust instruction, use a library-level benchmark like google/benchmark or Criterion.rs.

Q: How do I handle commands that need input?

A: You can pipe data into the command inside the string:

hyperfine 'cat data.txt | ./my_processor'

Conclusion

Performance is a science, not a feeling. By using Hyperfine, you move away from the “works on my machine” anecdotes and toward cold, hard, reproducible data. Whether you are optimizing a CI/CD pipeline, comparing compression algorithms, or just deciding which CLI tool deserves a place in your dotfiles, let the math do the talking.

Stop guessing. Start measuring.

References & Further Reading

sharkdp: Hyperfine Official GitHub Repository - The source of truth for all flags and documentation.
Brendan Gregg: Performance Tools and Methodology - The world’s leading expert on system performance.
SharkDP Blog: How to Benchmarking Correctively - A must-read on statistical pitfalls.
Rust Programming: Why Rust is great for CLI Tools - Exploring the language that powers Hyperfine’s speed.
Linux Handbook: Modern Alternatives to Legacy Unix Tools - Finding the next fd, rg, and bat.