Reliable Benchmarking:
Requirements and Solutions
Dirk Beyer, Stefan Löwe, and Philipp Wendler
2023-04-22 @ RRRR’23
Philipp Wendler 1 / 21
Benchmarking is Important
Evaluation of new approaches
Evaluation of tools
Competitions
Tool development (testing, optimizations)
Reliable, reproducible, and accurate results needed!
Philipp Wendler 2 / 21
Benchmarking is Hard
Influence of I/O
Networking
Distributed tools
User input
Not relevant
for many tools
(solver, verifiers, . . . )
Easy?
Different hardware
architectures
Heterogeneity of tools
Parallel benchmarks
Relevant!
Philipp Wendler 3 / 21
Benchmarking is Hard
Influence of I/O
Networking
Distributed tools
User input
Not relevant
for many tools
(solver, verifiers, . . . )
Easy?
Different hardware
architectures
Heterogeneity of tools
Parallel benchmarks
Relevant!
Philipp Wendler 3 / 21
Benchmarking is Hard
Influence of I/O
Networking
Distributed tools
User input
Not relevant
for many tools
(solver, verifiers, . . . )
Easy?
Different hardware
architectures
Heterogeneity of tools
Parallel benchmarks
Relevant!
Philipp Wendler 3 / 21
Goals
Reproducibility
Avoid non-deterministic effects and interferences
Provide defined set of resources
Accurate results
For solvers, verification tools, etc.
On Linux
Philipp Wendler 4 / 21
Checklist
1. Measure and Limit Resources Accurately
Time
Memory
2. Terminate Processes Reliably
3. Assign Cores Deliberately
4. Respect Non-Uniform Memory Access
5. Avoid Swapping
6. Isolate Individual Runs
Communication
File system
Philipp Wendler 5 / 21
Measure and Limit Resources Accurately
Wall time and CPU time
Define memory consumption
Size of address space? Too large
Size of heap? Too low
Size of resident set (RSS)?
Measure peak consumption (without sampling)
Always define memory limit for reproducibility
Include sub-processes
Philipp Wendler 6 / 21
Measuring CPU time with time
~$ time verifier
real Xs
user Ys
sys Zs
Verifier
Subprocess 1
Subprocess 2
Subprocess n
CPU time may not be
included in measurement
~$ ulimit -v 1048576 # 1 GiB
~$ verifier
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
What about shared memory?
~$ verifier
~$ kill <PID>
Process might keep running
and occupy resources
Philipp Wendler 7 / 21
Measuring CPU time with time
~$ time verifier
real Xs
user Ys
sys Zs
Verifier
Subprocess 1
Subprocess 2
Subprocess n
CPU time may not be
included in measurement
~$ ulimit -v 1048576 # 1 GiB
~$ verifier
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
What about shared memory?
~$ verifier
~$ kill <PID>
Process might keep running
and occupy resources
Philipp Wendler 7 / 21
Measuring CPU time with time
~$ time verifier
real Xs
user Ys
sys Zs
Verifier
Subprocess 1
Subprocess 2
Subprocess n
CPU time may not be
included in measurement
~$ ulimit -v 1048576 # 1 GiB
~$ verifier
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
What about shared memory?
~$ verifier
~$ kill <PID>
Process might keep running
and occupy resources
Philipp Wendler 7 / 21
Terminate Processes Reliably
~$ time verifier
real Xs
user Ys
sys Zs
Verifier
Subprocess 1
Subprocess 2
Subprocess n
CPU time may not be
included in measurement
~$ ulimit -v 1048576 # 1 GiB
~$ verifier
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
Process may use 1 GiB
What about shared memory?
~$ verifier
~$ kill <PID>
Process might keep running
and occupy resources
Philipp Wendler 7 / 21
Isolate Individual Runs
Excerpt of start script taken from some verifier in
SV-COMP:
# . . . (tool started here)
killall z3 2> /dev/null
killall minisat 2> /dev/null
killall yices 2> /dev/null
Thanks for thinking of cleanup
But what if there are parallel runs?
Philipp Wendler 8 / 21
Isolate Individual Runs
Excerpt of start script taken from some verifier in
SV-COMP:
# . . . (tool started here)
killall z3 2> /dev/null
killall minisat 2> /dev/null
killall yices 2> /dev/null
Thanks for thinking of cleanup
But what if there are parallel runs?
Philipp Wendler 8 / 21
Isolate Individual Runs
Temp files with constant names like /tmp/mytool.tmp
collide
State stored in places like ~/.mytool
hinders reproducibility
Sometimes even auto-generated
Restrict changes to file system
as far as possible
Philipp Wendler 9 / 21
Cgroups
Linux kernel “control groups”
Reliable tracking of spawned processes
Resource limits and measurements per cgroup
CPU time
Memory
I/O etc.
Only solution on Linux
for race-free handling of multiple processes!
Philipp Wendler 10 / 21
Namespaces
Light-weight virtualization
Only one kernel running, no additional layers
Change how processes see the system
Identifiers like PIDs, paths, etc. can have different
meanings in each namespace
PID 42 can be a different process in each namespace
Directory
/
can be a different directory in each namespace
. . .
Can be used to build application containers
without possibility to escape
Usable without root access
Philipp Wendler 11 / 21
Benchmarking Containers
Encapsulate groups of processes
Limited resources (memory, cores)
Total resource consumption measurable
All other processes hidden
and no communication with them
Disabled network access
Adjusted file-system layout
Private /tmp
Writes redirected to
temporary RAM disk
Philipp Wendler 12 / 21
BenchExec
A Framework for Reliable Benchmarking
and Resource Measurement
Provides benchmarking containers
based on cgroups and namespaces
Allocates hardware resources appropriately
Low system requirements
(modern Linux kernel and cgroups access)
Philipp Wendler 13 / 21
BenchExec
Open source: Apache 2.0 License
Written in Python 3
https://github.com/sosy-lab/benchexec
Used in International Competition on Software Verification
(SV-COMP) and by StarExec
Originally developed
for software verification,
but applicable to arbitrary tools
Philipp Wendler 14 / 21
BenchExec Architecture
runexec
· · ·
runexec
benchexec
Bench.
Def.
Input
Files
XML
Results
table-generator
HTML
Table
TSV
Data
BenchExec
runexec
Benchmarks a single run of a tool (in container)
benchexec
Benchmarks multiple runs
table-generator
Generates TSV and interactive HTML tables
Philipp Wendler 15 / 21
BenchExec: runexec
Benchmarks a single run of a tool
Measures and limits resources using cgroups
Runnable as stand-alone tool and as Python module
Easy integration into other benchmarking frameworks
and infrastructure
Example:
runexec ––timelimit 100s ––memlimit 16GB
––cores 0-7,16-23 ––memoryNodes 0
–– <TOOL_CMD>
Philipp Wendler 16 / 21
BenchExec: runexec
Isolation
Resource Limitation /
Measurement
2 Process
Run
runexec
Isolation
Resource Limitation /
Measurement
2 Process
Run
runexec
CPU Cores
3 3 3 3
Memory
Philipp Wendler 17 / 21
BenchExec: benchexec
Benchmarks multiple runs
(e.g., a set of configurations against a set of files)
Allocates hardware resources
Can check whether tool result is as expected
for given input file and property
Philipp Wendler 18 / 21
BenchExec: table-generator
Aggregates results
Extracts statistic values from tool output
Generates TSV and interactive HTML tables (with plots)
Computes result differences and regression counts
Philipp Wendler 19 / 21
Please Read More
Dirk Beyer, Stefan Löwe, and Philipp Wendler.
Reliable Benchmarking:
Requirements and Solutions. [1]
STTT 2019
More details
Study of hardware influence on benchmarking results
Suggestions how to present results
(result aggregation, rounding, plots, etc.)
Philipp Wendler 20 / 21
Conclusion
Be careful when benchmarking!
Don’t use time, ulimit etc.
Always use cgroups and namespaces!
BenchExec
https://github.com/sosy-lab/benchexec
Philipp Wendler 21 / 21
References I
[1] Beyer, D., Löwe, S., Wendler, P.: Reliable benchmarking: Requirements and
solutions. Int. J. Softw. Tools Technol. Transfer 21(1), 1–29 (2019).
https://doi.org/10.1007/s10009-017-0469-y
Philipp Wendler 22 / 21
Assign Cores Deliberately
Hyper Threading:
Multiple threads sharing execution units
Shared caches
Philipp Wendler 23 / 21
Respect Non-Uniform Memory Access (NUMA)
Memory regions have different performance depending on
current CPU core
Hierarchical NUMA makes things worse
Philipp Wendler 24 / 21
Type lstopo on your machine (Ubuntu: package hwloc)
CPU
memory region
core
Philipp Wendler 25 / 21
Cgroups
Hierarchical tree of sets of processes
/
. . .
/user1
/benchmarks
/benchmarks/run1
5542 (bash)
5544 (firefox)
. . .
. . .
1130 (verifier)
1131 (subprocess1)
. . .
Philipp Wendler 26 / 21
BenchExec Configuration
Tool command line
Expected result
Resource limits
CPU time, wall time
Memory
Container setup
Network access
File-system layout
Where to put result files
Philipp Wendler 27 / 21
Directory Access Modes
Read Write temp Write persistent
existing content content content
hidden
read only
overlay
full access
Philipp Wendler 28 / 21