How to Benchmark Anything with Time, Hyperfine, and More

A close-up of a hand with a pen analyzing data on colorful bar and line charts on paper.
A close-up of a hand with a pen analyzing data on colorful bar and line charts on paper.

How to Benchmark Anything with Time, Hyperfine, and More

Performance is paramount. Whether you’re a developer optimizing a critical algorithm, a system administrator troubleshooting a slow server, or a user curious about the efficiency of two competing command-line tools, understanding how to accurately measure execution time and resource consumption is an invaluable skill.

This post will guide you through the essentials of benchmarking, starting with the simplest tools and progressing to sophisticated statistical approaches and in-depth profiling. Our goal is to equip you with the knowledge to perform reliable, repeatable measurements and derive meaningful insights.

Why Benchmark Anything?

Before we dive into the “how,” let’s quickly address the “why”:

  • Identify Bottlenecks: Pinpoint the exact parts of your code or system that are slowing things down.
  • Validate Optimizations: Confirm that your changes actually improve performance, rather than just increasing complexity.
  • Compare Alternatives: Objectively decide between different algorithms, libraries, or system configurations.
  • Prevent Regressions: Integrate performance tests into your continuous integration (CI) pipeline to catch slowdowns before they reach production.
  • Understand System Behavior: Gain a deeper understanding of how your software interacts with hardware and the operating system.

Reliable benchmarking isn’t just about getting a number; it’s about making informed decisions.

The Humble time Command: Quick and Dirty

Almost every Unix-like system comes with the time command, which is the simplest way to measure how long a command takes to execute.

How it Works

Simply prepend time to any command you want to measure:

time your_command_here

After your_command_here finishes, time will print a summary of the resources used. While its exact output can vary depending on your shell (bash, zsh, etc.) and the specific time utility (there’s a built-in shell time and a separate /usr/bin/time), the core metrics are usually:

  • real (or elapsed): This is the wall-clock time, from when the command started to when it finished. It includes time spent waiting for I/O, other processes, or anything else that contributes to the total elapsed duration. This is typically what a user experiences.
  • user: The amount of CPU time spent executing in user mode (i.e., running your program’s code, not system calls).
  • sys: The amount of CPU time spent executing in kernel mode on behalf of your program (i.e., performing system calls like reading files, network operations, etc.).

Example:

Let’s time a simple sleep command:

$ time sleep 1

real    0m1.002s
user    0m0.000s
sys     0m0.001s

Here, real is just over 1 second, as expected. user and sys times are negligible because sleep spends most of its time waiting, not actively computing.

Now, a more compute-intensive example:

$ time python -c "sum(range(10**7))"

real    0m0.375s
user    0m0.370s
sys     0m0.004s

Notice how user time closely tracks real time here, indicating that the Python interpreter was busy calculating in user space.

Limitations of time

While time is great for quick checks, it has significant limitations for serious benchmarking:

  1. Single Run: It only runs your command once. Performance can vary wildly due to background processes, CPU cache states, disk caching, and other transient system conditions. A single measurement is rarely reliable.
  2. No Statistics: It provides no average, standard deviation, or other statistical measures to understand the variability of your command’s execution time.
  3. No Warm-up: Many applications or systems have “cold start” overhead (e.g., JIT compilation, loading data into cache). time doesn’t account for this, potentially skewing results.
  4. No Comparison: You have to manually run multiple commands and compare their outputs, which is tedious and error-prone.

For anything more than a rough estimate, you need something more robust.

Enter hyperfine: The Statistical Powerhouse

hyperfine is a command-line benchmarking tool that provides robust statistical analysis, making it ideal for comparing different commands or testing the impact of code changes. It’s written in Rust, cross-platform, and offers a beautiful, clear output.

Why hyperfine is Superior

hyperfine addresses all the limitations of time and more:

  • Multiple Runs: It runs commands multiple times, collecting a dataset of execution times.
  • Statistical Analysis: It calculates minimum, maximum, mean, median, and standard deviation, giving you a much clearer picture of performance.
  • Warm-up Runs: It performs initial “warm-up” runs that are discarded, ensuring that the actual measurements aren’t affected by cold caches or initial setup.
  • Statistical Comparison: When comparing multiple commands, it uses statistical tests to determine if differences are significant.
  • Setup/Cleanup: Allows you to define commands to run before and after each benchmarked command.
  • Clear Output: Its tabular and color-coded output is easy to read and interpret.
  • Export Formats: Can export results to CSV, JSON, Markdown, and more for further analysis.

Installation

hyperfine can be installed via various package managers:

  • Rust’s Cargo (recommended for latest version):
    cargo install hyperfine
  • Homebrew (macOS/Linux):
    brew install hyperfine
  • Apt (Debian/Ubuntu):
    sudo apt install hyperfine
  • Pacman (Arch Linux):
    sudo pacman -S hyperfine
  • Chocolatey (Windows):
    choco install hyperfine

For more options, check the hyperfine GitHub repository.

Basic Usage

Benchmarking a single command is straightforward:

hyperfine 'your_command_here'

Example:

$ hyperfine 'sleep 0.1'

Benchmark #1: sleep 0.1
  Time (mean ± σ):      100.2 ms ±   0.6 ms    [User: 0.0 ms, System: 0.2 ms]
  Range (min … max):    99.7 ms … 101.4 ms    10 runs

The output shows the mean execution time, its standard deviation (σ), and the range, along with user and system times. It also tells you how many runs were performed.

Comparing Commands

This is where hyperfine truly shines. You can compare two or more commands by listing them:

hyperfine 'command_A' 'command_B' 'command_C'

Example: grep vs. cat | grep

Let’s compare the efficiency of piping cat to grep versus just using grep directly on a large file. First, create a large dummy file:

head -c 100MB /dev/urandom > large_file.txt

Now, benchmark:

$ hyperfine 'cat large_file.txt | grep -c "xyz"' 'grep -c "xyz" large_file.txt'

Benchmark #1: cat large_file.txt | grep -c "xyz"
  Time (mean ± σ):      27.3 ms ±   0.5 ms    [User: 2.7 ms, System: 15.6 ms]
  Range (min … max):    26.7 ms … 28.1 ms    88 runs

Benchmark #2: grep -c "xyz" large_file.txt
  Time (mean ± σ):      13.5 ms ±   0.3 ms    [User: 13.0 ms, System: 0.4 ms]
  Range (min … max):    13.2 ms … 14.1 ms    186 runs

Summary
  'grep -c "xyz" large_file.txt' ran
    2.02 ± 0.06 times faster than 'cat large_file.txt | grep -c "xyz"'

The summary clearly shows that grep directly on the file is twice as fast, as it avoids the overhead of creating a pipe and an extra process (cat).

Advanced hyperfine Options

  • --warmup <N>: Perform N warm-up runs that are not timed. Crucial for JIT-compiled languages or disk caching effects.
  • --runs <N>: Perform N actual benchmark runs (default is 10). Increase for higher precision, especially if your command has high variability.
  • --setup <command>: A command to run once before each benchmarked command starts. Useful for creating temporary files or resetting a database state.
  • --cleanup <command>: A command to run once after each benchmarked command finishes. Useful for deleting temporary files.
  • --prepare <command>: A command run once before all benchmark runs. Useful for setting up the environment.
  • --export-csv <file>, --export-json <file>, --export-markdown <file>: Export results for programmatic analysis or reporting.
  • --ignore-failure: Continue benchmarking even if a command returns a non-zero exit code.
  • --shell <shell>: Specify the shell to use (e.g., bash, zsh).

Example: Benchmarking file creation and deletion

hyperfine --runs 5 --prepare 'mkdir -p tmp_dir' \
          'touch tmp_dir/test_file.txt' \
          'rm tmp_dir/test_file.txt' \
          --cleanup 'rm -rf tmp_dir'

# Output will show times for 'touch' and 'rm', with tmp_dir created/cleaned around each run

hyperfine is an indispensable tool for anyone serious about measuring command-line or script performance.

Beyond Wall Clock: Profiling for Insights

While time and hyperfine tell you how long something takes, they don’t tell you why. For that, you need profiling tools. Profilers analyze your program’s execution to show you where time is being spent – which functions are called most often, which consume the most CPU cycles, or where memory is being allocated.

Benchmarking answers “Is it fast enough?” or “Is A faster than B?”. Profiling answers “Where is the slowdown?” or “Why is it slow?”.

System-Wide Profiling Tools

These tools operate at the operating system level, giving you insights into process behavior, system calls, and even hardware performance counters.

  1. perf (Linux): perf is a powerful performance analysis tool built into the Linux kernel. It can sample CPU activity, count hardware events (e.g., cache misses, branch mispredictions), and trace system calls. It’s often used to find CPU bottlenecks.

    # Record performance data for a command
    sudo perf record -g your_command_here
    
    # Analyze the recorded data (shows a call graph, hot functions)
    perf report

    perf can be intimidating due to its depth, but it’s essential for low-level performance analysis. Learn more about perf on the Linux perf wiki

  2. strace (Linux): strace traces system calls made by a process and the signals it receives. It’s invaluable for debugging I/O-bound issues, permissions problems, or understanding how a program interacts with the kernel.

    # Trace all system calls
    strace your_command_here
    
    # Summarize system call counts and times
    strace -c your_command_here

    If your program spends a lot of sys time according to time, strace -c can tell you which system calls are consuming that time. Check the strace man page

  3. ltrace (Linux): Similar to strace, but ltrace intercepts and records calls to dynamic library functions (e.g., functions from libc). Useful for understanding interactions with common libraries.

    ltrace your_command_here
  4. DTrace (macOS/FreeBSD/Solaris): DTrace is a comprehensive dynamic tracing framework. It allows you to create custom scripts (using the D language) to observe almost anything happening on your system in real-time, from file system I/O to network activity to specific function calls within processes. It’s extremely powerful but has a steeper learning curve. Explore DTrace further

Memory and Call Graph Profilers

These tools focus on the execution flow and memory usage within your application.

  1. Valgrind (Linux): Valgrind is a suite of debugging and profiling tools. While often used for memory error detection (memcheck), its callgrind tool is an excellent call-graph profiler.

    # Run your program under callgrind
    valgrind --tool=callgrind your_program_here
    
    # Visualize the results with KCachegrind (GUI tool)
    kcachegrind callgrind.out.<PID>

    Valgrind executes your program in a virtual machine, making it very slow (10x-100x slowdown), but it provides incredibly detailed information about function calls, inclusive/exclusive costs, and cache behavior. Visit the Valgrind website

  2. gprof (GNU Profiler): For C/C++ programs compiled with gcc -pg, gprof can generate flat profiles (time spent in each function) and call graphs (who called whom).

    gcc -pg my_program.c -o my_program
    ./my_program
    gprof my_program gmon.out > profile_output.txt

    gprof is simpler than perf or Valgrind but provides useful insights for compiled code. Consult the gprof manual

Language-Specific Profilers

Most modern programming languages come with built-in or widely used third-party profilers:

  • Python: The cProfile module (or profile for pure Python) offers deterministic profiling. Tools like SnakeViz or pyinstrument can visualize the results.
  • Node.js: Use the --inspect flag to enable the V8 inspector, then connect with Chrome DevTools or a dedicated profiler.
  • Java: JVisualVM, JProfiler, and YourKit are popular tools.
  • Go: The pprof package is excellent for CPU, memory, and blocking profiles.
  • Ruby: StackProf for CPU profiling.
  • PHP: Xdebug or Blackfire.io.

Note: When using language-specific profilers, ensure you understand what metric they are optimizing for (e.g., CPU time, wall clock time, garbage collection pauses) and how they handle I/O or system calls, as these might not be directly in the “profiled” code.

Resource Monitoring: Understanding the Footprint

While not strictly benchmarking, monitoring tools provide crucial context during benchmarks. They show you how your process impacts the system’s CPU, memory, disk I/O, and network. A process that’s “fast” but consumes all your RAM might not be a win.

  • top/htop/glances: These are interactive process viewers that show real-time CPU usage, memory consumption, running processes, and more. htop and glances are enhanced versions with better UIs and more features than the basic top.
  • sar (System Activity Reporter): Collects, reports, or saves system activity information. Great for historical analysis of CPU utilization, memory paging, disk I/O, network stats, etc.
  • iostat: Reports CPU utilization and disk I/O statistics (reads/writes per second, block size, queue length). Essential for I/O-bound benchmarks.
  • vmstat: Reports on virtual memory statistics (processes, memory, paging, block I/O, traps, CPU activity).
  • netstat/ss: Show network connections, routing tables, interface statistics. Useful for network-bound benchmarks.
  • du/df: Check disk usage and free space.

Use these tools while your benchmark is running to see if your “bottleneck” is truly CPU, or if you’re hitting disk limits, memory pressure, or network saturation.

Specialized Benchmarking and Load Testing

For specific types of applications, general-purpose tools might not be enough.

Web and API Benchmarking

When testing web servers or APIs, you’re often interested in metrics like Requests Per Second (RPS), latency, throughput, and error rates under load.

  • ApacheBench (ab): A simple command-line tool for HTTP server benchmarking.

    ab -n 1000 -c 100 http://localhost:8080/index.html
    # -n: total requests, -c: concurrency

    ApacheBench documentation

  • wrk: A modern HTTP benchmarking tool that can generate significant load on a single multi-core CPU. It’s often much faster than ab.

    wrk -t 4 -c 100 -d 30s http://localhost:8080/
    # -t: threads, -c: connections, -d: duration

    wrk GitHub repository

  • JMeter (Apache): A sophisticated, GUI-based tool capable of comprehensive load and performance testing for various protocols (HTTP, FTP, databases, SOAP/REST web services, etc.). It can simulate complex user scenarios.

  • Locust (Python): A powerful, scriptable, and distributed load testing tool. You write your load tests in Python code, which allows for very flexible scenario definitions.

  • k6: A modern, open-source load testing tool focused on developer experience, scriptable in JavaScript, and designed for testing APIs and microservices.

Database Benchmarking

For databases, specialized benchmarks simulate typical database workloads.

  • sysbench: A modular and cross-platform benchmark tool for evaluating OS parameters that are important for a heavily loaded system, often used for database benchmarks (OLTP, point selects, etc.).
  • pgbench: A simple program for running benchmark tests on PostgreSQL.
  • tpcc-mysql: A common benchmark for MySQL that simulates an online transaction processing (OLTP) workload.

Best Practices for Reliable Benchmarking

Measuring performance accurately is a subtle art. Here are key principles to follow:

  1. Control the Environment:

    • Isolate your tests: Run benchmarks on a dedicated machine or a clean virtual machine/container if possible, to minimize interference from background processes, network traffic, or other users.
    • Disable power-saving features: Modern CPUs often throttle their speed to save power. Disable CPU frequency scaling, “turbo boost,” or other dynamic power management features that can introduce variability.
    • Consistent input data: Use the same input files, database states, or network conditions for all comparative tests. Randomness introduces noise.
  2. Run Multiple Times and Average: As we saw with hyperfine, a single measurement is meaningless. Run your benchmark many times and use the average (mean) and standard deviation to understand the typical performance and its variability. The more variable your results, the more runs you need.

  3. Account for Warm-up: Many systems (JVMs, JIT compilers, disk caches, databases) have initial overhead or “cold start” effects. Use hyperfine’s --warmup flag or design your scripts to perform some dummy operations before actual measurements begin.

  4. Measure the Right Thing:

    • Wall-clock time vs. CPU time: Understand the difference (real vs. user/sys). Wall-clock time is often what matters to the user, but CPU time reveals the computational burden.
    • Focus on relevant metrics: Are you optimizing for throughput (items/second), latency (time per request), memory usage, or cold start time? Tailor your metrics to your goal.
    • Beware of Micro-benchmarks: Testing a single line of code in isolation might yield impressive speedups, but if that line is rarely executed in a real-world scenario, the overall system performance won’t improve. Benchmark representative workloads.
  5. Statistical Significance:

    • Understand standard deviation (σ). A small standard deviation indicates consistent results. A large one suggests high variability or external factors influencing your benchmark.
    • When comparing two results, consider if the difference is statistically significant. hyperfine does this automatically, but if doing manual analysis, look into t-tests or similar.
  6. Document Everything: Record your:

    • Hardware specifications (CPU, RAM, disk type).
    • Operating system version and patch level.
    • Software versions (language runtime, libraries, compilers).
    • Exact benchmark commands and input parameters.
    • Environment variables or system configurations.
    • This allows for reproducibility and comparison over time.
  7. Visualize Your Results: Graphs (bar charts for comparisons, line graphs for trends) can make performance data much easier to interpret than raw numbers. Tools like Gnuplot, Matplotlib (Python), or even spreadsheets can help.

  8. Don’t Optimize Prematurely: This is a classic programming adage. Don’t spend time optimizing code that isn’t a bottleneck. Benchmark first to identify the real bottlenecks, then optimize, and then benchmark again to verify the improvement.

  9. Beware of Caching Effects: File system caches, CPU caches (L1, L2, L3), and even network caches can significantly skew results, especially on repeated runs. For disk I/O benchmarks, you might need to drop caches between runs (e.g., sudo sync; sudo echo 3 > /proc/sys/vm/drop_caches on Linux – use with caution on production systems!).

Note: Benchmarking I/O-bound tasks can be particularly challenging due to the complex interplay of disk speeds, file system overhead, and operating system caching. Always confirm cache effects are handled appropriately for your specific test.

Conclusion

Benchmarking is an essential skill in the modern tech landscape. It transforms guesswork into data-driven decision-making, allowing you to build faster, more efficient, and more reliable systems.

Start simple with time for quick checks. Graduate to hyperfine for rigorous statistical analysis and reliable comparisons. When you need to understand why something is slow, dive into profiling with tools like perf, strace, Valgrind, or language-specific profilers. And always complement your benchmarks with resource monitoring to understand the full system impact.

By adopting these tools and best practices, you’ll be well-equipped to measure, understand, and ultimately improve the performance of anything you put your mind to. Happy benchmarking!

Last updated on