How to Benchmark Anything with Time, Hyperfine, and More
Performance is paramount. Whether you’re a developer optimizing a critical algorithm, a system administrator troubleshooting a slow server, or a user curious about the efficiency of two competing command-line tools, understanding how to accurately measure execution time and resource consumption is an invaluable skill.
This post will guide you through the essentials of benchmarking, starting with the simplest tools and progressing to sophisticated statistical approaches and in-depth profiling. Our goal is to equip you with the knowledge to perform reliable, repeatable measurements and derive meaningful insights.
Why Benchmark Anything?
Before we dive into the “how,” let’s quickly address the “why”:
- Identify Bottlenecks: Pinpoint the exact parts of your code or system that are slowing things down.
- Validate Optimizations: Confirm that your changes actually improve performance, rather than just increasing complexity.
- Compare Alternatives: Objectively decide between different algorithms, libraries, or system configurations.
- Prevent Regressions: Integrate performance tests into your continuous integration (CI) pipeline to catch slowdowns before they reach production.
- Understand System Behavior: Gain a deeper understanding of how your software interacts with hardware and the operating system.
Reliable benchmarking isn’t just about getting a number; it’s about making informed decisions.
The Humble time
Command: Quick and Dirty
Almost every Unix-like system comes with the time
command, which is the simplest way to measure how long a command takes to execute.
How it Works
Simply prepend time
to any command you want to measure:
time your_command_here
After your_command_here
finishes, time
will print a summary of the resources used. While its exact output can vary depending on your shell (bash, zsh, etc.) and the specific time
utility (there’s a built-in shell time
and a separate /usr/bin/time
), the core metrics are usually:
real
(orelapsed
): This is the wall-clock time, from when the command started to when it finished. It includes time spent waiting for I/O, other processes, or anything else that contributes to the total elapsed duration. This is typically what a user experiences.user
: The amount of CPU time spent executing in user mode (i.e., running your program’s code, not system calls).sys
: The amount of CPU time spent executing in kernel mode on behalf of your program (i.e., performing system calls like reading files, network operations, etc.).
Example:
Let’s time a simple sleep
command:
$ time sleep 1
real 0m1.002s
user 0m0.000s
sys 0m0.001s
Here, real
is just over 1 second, as expected. user
and sys
times are negligible because sleep
spends most of its time waiting, not actively computing.
Now, a more compute-intensive example:
$ time python -c "sum(range(10**7))"
real 0m0.375s
user 0m0.370s
sys 0m0.004s
Notice how user
time closely tracks real
time here, indicating that the Python interpreter was busy calculating in user space.
Limitations of time
While time
is great for quick checks, it has significant limitations for serious benchmarking:
- Single Run: It only runs your command once. Performance can vary wildly due to background processes, CPU cache states, disk caching, and other transient system conditions. A single measurement is rarely reliable.
- No Statistics: It provides no average, standard deviation, or other statistical measures to understand the variability of your command’s execution time.
- No Warm-up: Many applications or systems have “cold start” overhead (e.g., JIT compilation, loading data into cache).
time
doesn’t account for this, potentially skewing results. - No Comparison: You have to manually run multiple commands and compare their outputs, which is tedious and error-prone.
For anything more than a rough estimate, you need something more robust.
Enter hyperfine
: The Statistical Powerhouse
hyperfine
is a command-line benchmarking tool that provides robust statistical analysis, making it ideal for comparing different commands or testing the impact of code changes. It’s written in Rust, cross-platform, and offers a beautiful, clear output.
Why hyperfine
is Superior
hyperfine
addresses all the limitations of time
and more:
- Multiple Runs: It runs commands multiple times, collecting a dataset of execution times.
- Statistical Analysis: It calculates minimum, maximum, mean, median, and standard deviation, giving you a much clearer picture of performance.
- Warm-up Runs: It performs initial “warm-up” runs that are discarded, ensuring that the actual measurements aren’t affected by cold caches or initial setup.
- Statistical Comparison: When comparing multiple commands, it uses statistical tests to determine if differences are significant.
- Setup/Cleanup: Allows you to define commands to run before and after each benchmarked command.
- Clear Output: Its tabular and color-coded output is easy to read and interpret.
- Export Formats: Can export results to CSV, JSON, Markdown, and more for further analysis.
Installation
hyperfine
can be installed via various package managers:
- Rust’s Cargo (recommended for latest version):
cargo install hyperfine
- Homebrew (macOS/Linux):
brew install hyperfine
- Apt (Debian/Ubuntu):
sudo apt install hyperfine
- Pacman (Arch Linux):
sudo pacman -S hyperfine
- Chocolatey (Windows):
choco install hyperfine
For more options, check the hyperfine GitHub repository.
Basic Usage
Benchmarking a single command is straightforward:
hyperfine 'your_command_here'
Example:
$ hyperfine 'sleep 0.1'
Benchmark #1: sleep 0.1
Time (mean ± σ): 100.2 ms ± 0.6 ms [User: 0.0 ms, System: 0.2 ms]
Range (min … max): 99.7 ms … 101.4 ms 10 runs
The output shows the mean execution time, its standard deviation (σ), and the range, along with user and system times. It also tells you how many runs were performed.
Comparing Commands
This is where hyperfine
truly shines. You can compare two or more commands by listing them:
hyperfine 'command_A' 'command_B' 'command_C'
Example: grep
vs. cat | grep
Let’s compare the efficiency of piping cat
to grep
versus just using grep
directly on a large file. First, create a large dummy file:
head -c 100MB /dev/urandom > large_file.txt
Now, benchmark:
$ hyperfine 'cat large_file.txt | grep -c "xyz"' 'grep -c "xyz" large_file.txt'
Benchmark #1: cat large_file.txt | grep -c "xyz"
Time (mean ± σ): 27.3 ms ± 0.5 ms [User: 2.7 ms, System: 15.6 ms]
Range (min … max): 26.7 ms … 28.1 ms 88 runs
Benchmark #2: grep -c "xyz" large_file.txt
Time (mean ± σ): 13.5 ms ± 0.3 ms [User: 13.0 ms, System: 0.4 ms]
Range (min … max): 13.2 ms … 14.1 ms 186 runs
Summary
'grep -c "xyz" large_file.txt' ran
2.02 ± 0.06 times faster than 'cat large_file.txt | grep -c "xyz"'
The summary clearly shows that grep
directly on the file is twice as fast, as it avoids the overhead of creating a pipe and an extra process (cat
).
Advanced hyperfine
Options
--warmup <N>
: PerformN
warm-up runs that are not timed. Crucial for JIT-compiled languages or disk caching effects.--runs <N>
: PerformN
actual benchmark runs (default is 10). Increase for higher precision, especially if your command has high variability.--setup <command>
: A command to run once before each benchmarked command starts. Useful for creating temporary files or resetting a database state.--cleanup <command>
: A command to run once after each benchmarked command finishes. Useful for deleting temporary files.--prepare <command>
: A command run once before all benchmark runs. Useful for setting up the environment.--export-csv <file>
,--export-json <file>
,--export-markdown <file>
: Export results for programmatic analysis or reporting.--ignore-failure
: Continue benchmarking even if a command returns a non-zero exit code.--shell <shell>
: Specify the shell to use (e.g.,bash
,zsh
).
Example: Benchmarking file creation and deletion
hyperfine --runs 5 --prepare 'mkdir -p tmp_dir' \
'touch tmp_dir/test_file.txt' \
'rm tmp_dir/test_file.txt' \
--cleanup 'rm -rf tmp_dir'
# Output will show times for 'touch' and 'rm', with tmp_dir created/cleaned around each run
hyperfine
is an indispensable tool for anyone serious about measuring command-line or script performance.
Beyond Wall Clock: Profiling for Insights
While time
and hyperfine
tell you how long something takes, they don’t tell you why. For that, you need profiling tools. Profilers analyze your program’s execution to show you where time is being spent – which functions are called most often, which consume the most CPU cycles, or where memory is being allocated.
Benchmarking answers “Is it fast enough?” or “Is A faster than B?”. Profiling answers “Where is the slowdown?” or “Why is it slow?”.
System-Wide Profiling Tools
These tools operate at the operating system level, giving you insights into process behavior, system calls, and even hardware performance counters.
-
perf
(Linux):perf
is a powerful performance analysis tool built into the Linux kernel. It can sample CPU activity, count hardware events (e.g., cache misses, branch mispredictions), and trace system calls. It’s often used to find CPU bottlenecks.# Record performance data for a command sudo perf record -g your_command_here # Analyze the recorded data (shows a call graph, hot functions) perf report
perf
can be intimidating due to its depth, but it’s essential for low-level performance analysis. Learn more aboutperf
on the Linux perf wiki -
strace
(Linux):strace
traces system calls made by a process and the signals it receives. It’s invaluable for debugging I/O-bound issues, permissions problems, or understanding how a program interacts with the kernel.# Trace all system calls strace your_command_here # Summarize system call counts and times strace -c your_command_here
If your program spends a lot of
sys
time according totime
,strace -c
can tell you which system calls are consuming that time. Check thestrace
man page -
ltrace
(Linux): Similar tostrace
, butltrace
intercepts and records calls to dynamic library functions (e.g., functions fromlibc
). Useful for understanding interactions with common libraries.ltrace your_command_here
-
DTrace
(macOS/FreeBSD/Solaris):DTrace
is a comprehensive dynamic tracing framework. It allows you to create custom scripts (using the D language) to observe almost anything happening on your system in real-time, from file system I/O to network activity to specific function calls within processes. It’s extremely powerful but has a steeper learning curve. Explore DTrace further
Memory and Call Graph Profilers
These tools focus on the execution flow and memory usage within your application.
-
Valgrind
(Linux):Valgrind
is a suite of debugging and profiling tools. While often used for memory error detection (memcheck
), itscallgrind
tool is an excellent call-graph profiler.# Run your program under callgrind valgrind --tool=callgrind your_program_here # Visualize the results with KCachegrind (GUI tool) kcachegrind callgrind.out.<PID>
Valgrind
executes your program in a virtual machine, making it very slow (10x-100x slowdown), but it provides incredibly detailed information about function calls, inclusive/exclusive costs, and cache behavior. Visit the Valgrind website -
gprof
(GNU Profiler): For C/C++ programs compiled withgcc -pg
,gprof
can generate flat profiles (time spent in each function) and call graphs (who called whom).gcc -pg my_program.c -o my_program ./my_program gprof my_program gmon.out > profile_output.txt
gprof
is simpler thanperf
orValgrind
but provides useful insights for compiled code. Consult the gprof manual
Language-Specific Profilers
Most modern programming languages come with built-in or widely used third-party profilers:
- Python: The
cProfile
module (orprofile
for pure Python) offers deterministic profiling. Tools likeSnakeViz
orpyinstrument
can visualize the results. - Node.js: Use the
--inspect
flag to enable the V8 inspector, then connect with Chrome DevTools or a dedicated profiler. - Java:
JVisualVM
,JProfiler
, andYourKit
are popular tools. - Go: The
pprof
package is excellent for CPU, memory, and blocking profiles. - Ruby:
StackProf
for CPU profiling. - PHP:
Xdebug
orBlackfire.io
.
Note: When using language-specific profilers, ensure you understand what metric they are optimizing for (e.g., CPU time, wall clock time, garbage collection pauses) and how they handle I/O or system calls, as these might not be directly in the “profiled” code.
Resource Monitoring: Understanding the Footprint
While not strictly benchmarking, monitoring tools provide crucial context during benchmarks. They show you how your process impacts the system’s CPU, memory, disk I/O, and network. A process that’s “fast” but consumes all your RAM might not be a win.
top
/htop
/glances
: These are interactive process viewers that show real-time CPU usage, memory consumption, running processes, and more.htop
andglances
are enhanced versions with better UIs and more features than the basictop
.sar
(System Activity Reporter): Collects, reports, or saves system activity information. Great for historical analysis of CPU utilization, memory paging, disk I/O, network stats, etc.iostat
: Reports CPU utilization and disk I/O statistics (reads/writes per second, block size, queue length). Essential for I/O-bound benchmarks.vmstat
: Reports on virtual memory statistics (processes, memory, paging, block I/O, traps, CPU activity).netstat
/ss
: Show network connections, routing tables, interface statistics. Useful for network-bound benchmarks.du
/df
: Check disk usage and free space.
Use these tools while your benchmark is running to see if your “bottleneck” is truly CPU, or if you’re hitting disk limits, memory pressure, or network saturation.
Specialized Benchmarking and Load Testing
For specific types of applications, general-purpose tools might not be enough.
Web and API Benchmarking
When testing web servers or APIs, you’re often interested in metrics like Requests Per Second (RPS), latency, throughput, and error rates under load.
-
ApacheBench (ab)
: A simple command-line tool for HTTP server benchmarking.ab -n 1000 -c 100 http://localhost:8080/index.html # -n: total requests, -c: concurrency
-
wrk
: A modern HTTP benchmarking tool that can generate significant load on a single multi-core CPU. It’s often much faster thanab
.wrk -t 4 -c 100 -d 30s http://localhost:8080/ # -t: threads, -c: connections, -d: duration
-
JMeter
(Apache): A sophisticated, GUI-based tool capable of comprehensive load and performance testing for various protocols (HTTP, FTP, databases, SOAP/REST web services, etc.). It can simulate complex user scenarios. -
Locust
(Python): A powerful, scriptable, and distributed load testing tool. You write your load tests in Python code, which allows for very flexible scenario definitions. -
k6
: A modern, open-source load testing tool focused on developer experience, scriptable in JavaScript, and designed for testing APIs and microservices.
Database Benchmarking
For databases, specialized benchmarks simulate typical database workloads.
sysbench
: A modular and cross-platform benchmark tool for evaluating OS parameters that are important for a heavily loaded system, often used for database benchmarks (OLTP, point selects, etc.).pgbench
: A simple program for running benchmark tests on PostgreSQL.tpcc-mysql
: A common benchmark for MySQL that simulates an online transaction processing (OLTP) workload.
Best Practices for Reliable Benchmarking
Measuring performance accurately is a subtle art. Here are key principles to follow:
-
Control the Environment:
- Isolate your tests: Run benchmarks on a dedicated machine or a clean virtual machine/container if possible, to minimize interference from background processes, network traffic, or other users.
- Disable power-saving features: Modern CPUs often throttle their speed to save power. Disable CPU frequency scaling, “turbo boost,” or other dynamic power management features that can introduce variability.
- Consistent input data: Use the same input files, database states, or network conditions for all comparative tests. Randomness introduces noise.
-
Run Multiple Times and Average: As we saw with
hyperfine
, a single measurement is meaningless. Run your benchmark many times and use the average (mean) and standard deviation to understand the typical performance and its variability. The more variable your results, the more runs you need. -
Account for Warm-up: Many systems (JVMs, JIT compilers, disk caches, databases) have initial overhead or “cold start” effects. Use
hyperfine
’s--warmup
flag or design your scripts to perform some dummy operations before actual measurements begin. -
Measure the Right Thing:
- Wall-clock time vs. CPU time: Understand the difference (
real
vs.user
/sys
). Wall-clock time is often what matters to the user, but CPU time reveals the computational burden. - Focus on relevant metrics: Are you optimizing for throughput (items/second), latency (time per request), memory usage, or cold start time? Tailor your metrics to your goal.
- Beware of Micro-benchmarks: Testing a single line of code in isolation might yield impressive speedups, but if that line is rarely executed in a real-world scenario, the overall system performance won’t improve. Benchmark representative workloads.
- Wall-clock time vs. CPU time: Understand the difference (
-
Statistical Significance:
- Understand standard deviation (
σ
). A small standard deviation indicates consistent results. A large one suggests high variability or external factors influencing your benchmark. - When comparing two results, consider if the difference is statistically significant.
hyperfine
does this automatically, but if doing manual analysis, look into t-tests or similar.
- Understand standard deviation (
-
Document Everything: Record your:
- Hardware specifications (CPU, RAM, disk type).
- Operating system version and patch level.
- Software versions (language runtime, libraries, compilers).
- Exact benchmark commands and input parameters.
- Environment variables or system configurations.
- This allows for reproducibility and comparison over time.
-
Visualize Your Results: Graphs (bar charts for comparisons, line graphs for trends) can make performance data much easier to interpret than raw numbers. Tools like Gnuplot, Matplotlib (Python), or even spreadsheets can help.
-
Don’t Optimize Prematurely: This is a classic programming adage. Don’t spend time optimizing code that isn’t a bottleneck. Benchmark first to identify the real bottlenecks, then optimize, and then benchmark again to verify the improvement.
-
Beware of Caching Effects: File system caches, CPU caches (L1, L2, L3), and even network caches can significantly skew results, especially on repeated runs. For disk I/O benchmarks, you might need to drop caches between runs (e.g.,
sudo sync; sudo echo 3 > /proc/sys/vm/drop_caches
on Linux – use with caution on production systems!).
Note: Benchmarking I/O-bound tasks can be particularly challenging due to the complex interplay of disk speeds, file system overhead, and operating system caching. Always confirm cache effects are handled appropriately for your specific test.
Conclusion
Benchmarking is an essential skill in the modern tech landscape. It transforms guesswork into data-driven decision-making, allowing you to build faster, more efficient, and more reliable systems.
Start simple with time
for quick checks. Graduate to hyperfine
for rigorous statistical analysis and reliable comparisons. When you need to understand why something is slow, dive into profiling with tools like perf
, strace
, Valgrind
, or language-specific profilers. And always complement your benchmarks with resource monitoring to understand the full system impact.
By adopting these tools and best practices, you’ll be well-equipped to measure, understand, and ultimately improve the performance of anything you put your mind to. Happy benchmarking!