Reliable Benchmarking: Requirements and Solutions

Reliable Benchmarking:

Requirements and Solutions

Dirk Beyer, Stefan Löwe, and Philipp Wendler

2023-04-22 @ RRRR’23

Philipp Wendler 1 / 21

Benchmarking is Important

▶

Evaluation of new approaches

▶

Evaluation of tools

▶

Competitions

▶

Tool development (testing, optimizations)

Reliable, reproducible, and accurate results needed!

Philipp Wendler 2 / 21

Benchmarking is Hard

▶

Inﬂuence of I/O

▶

Networking

▶

Distributed tools

▶

User input

Not relevant

for many tools

(solver, veriﬁers, . . . )

Easy?

▶

Diﬀerent hardware

architectures

▶

Heterogeneity of tools

▶

Parallel benchmarks

Relevant!

Philipp Wendler 3 / 21

Benchmarking is Hard

▶

Inﬂuence of I/O

▶

Networking

▶

Distributed tools

▶

User input

Not relevant

for many tools

(solver, veriﬁers, . . . )

Easy?

▶

Diﬀerent hardware

architectures

▶

Heterogeneity of tools

▶

Parallel benchmarks

Relevant!

Philipp Wendler 3 / 21

Benchmarking is Hard

▶

Inﬂuence of I/O

▶

Networking

▶

Distributed tools

▶

User input

Not relevant

for many tools

(solver, veriﬁers, . . . )

Easy?

▶

Diﬀerent hardware

architectures

▶

Heterogeneity of tools

▶

Parallel benchmarks

Relevant!

Philipp Wendler 3 / 21

Goals

▶

Reproducibility

▶

Avoid non-deterministic eﬀects and interferences

▶

Provide deﬁned set of resources

▶

Accurate results

▶

For solvers, veriﬁcation tools, etc.

▶

On Linux

Philipp Wendler 4 / 21

Checklist

1. Measure and Limit Resources Accurately

▶

Time

▶

Memory

2. Terminate Processes Reliably

3. Assign Cores Deliberately

4. Respect Non-Uniform Memory Access

5. Avoid Swapping

6. Isolate Individual Runs

▶

Communication

▶

File system

Philipp Wendler 5 / 21

Measure and Limit Resources Accurately

▶

Wall time and CPU time

▶

Deﬁne memory consumption

▶

Size of address space? Too large

▶

Size of heap? Too low

▶

Size of resident set (RSS)?

▶

Measure peak consumption (without sampling)

▶

Always deﬁne memory limit for reproducibility

▶

Include sub-processes

Philipp Wendler 6 / 21

Measuring CPU time with “time”

~$ time verifier

real Xs

user Ys

sys Zs

Veriﬁer

Subprocess 1

Subprocess 2

Subprocess n

CPU time may not be

included in measurement

~$ ulimit -v 1048576 # 1 GiB

~$ verifier

Process may use 1 GiB

What about shared memory?

~$ verifier

~$ kill <PID>

Process might keep running

and occupy resources

Philipp Wendler 7 / 21

Measuring CPU time with “time”

~$ time verifier

real Xs

user Ys

sys Zs

Veriﬁer

Subprocess 1

Subprocess 2

Subprocess n

CPU time may not be

included in measurement

~$ ulimit -v 1048576 # 1 GiB

~$ verifier

Process may use 1 GiB

What about shared memory?

~$ verifier

~$ kill <PID>

Process might keep running

and occupy resources

Philipp Wendler 7 / 21

Measuring CPU time with “time”

~$ time verifier

real Xs

user Ys

sys Zs

Veriﬁer

Subprocess 1

Subprocess 2

Subprocess n

CPU time may not be

included in measurement

~$ ulimit -v 1048576 # 1 GiB

~$ verifier

Process may use 1 GiB

What about shared memory?

~$ verifier

~$ kill <PID>

Process might keep running

and occupy resources

Philipp Wendler 7 / 21

Terminate Processes Reliably

~$ time verifier

real Xs

user Ys

sys Zs

Veriﬁer

Subprocess 1

Subprocess 2

Subprocess n

CPU time may not be

included in measurement

~$ ulimit -v 1048576 # 1 GiB

~$ verifier

Process may use 1 GiB

What about shared memory?

~$ verifier

~$ kill <PID>

Process might keep running

and occupy resources

Philipp Wendler 7 / 21

Isolate Individual Runs

▶

Excerpt of start script taken from some veriﬁer in

SV-COMP:

# . . . (tool started here)

killall z3 2> /dev/null

killall minisat 2> /dev/null

killall yices 2> /dev/null

▶

Thanks for thinking of cleanup

▶

But what if there are parallel runs?

Philipp Wendler 8 / 21

Isolate Individual Runs

▶

Excerpt of start script taken from some veriﬁer in

SV-COMP:

# . . . (tool started here)

killall z3 2> /dev/null

killall minisat 2> /dev/null

killall yices 2> /dev/null

▶

Thanks for thinking of cleanup

▶

But what if there are parallel runs?

Philipp Wendler 8 / 21

Isolate Individual Runs

▶

Temp ﬁles with constant names like /tmp/mytool.tmp

collide

▶

State stored in places like ~/.mytool

hinders reproducibility

▶

Sometimes even auto-generated

▶

Restrict changes to ﬁle system

as far as possible

Philipp Wendler 9 / 21

Cgroups

▶

Linux kernel “control groups”

▶

Reliable tracking of spawned processes

▶

Resource limits and measurements per cgroup

▶

CPU time

▶

Memory

▶

I/O etc.

Only solution on Linux

for race-free handling of multiple processes!

Philipp Wendler 10 / 21

Namespaces

▶

Light-weight virtualization

▶

Only one kernel running, no additional layers

▶

Change how processes see the system

▶

Identiﬁers like PIDs, paths, etc. can have diﬀerent

meanings in each namespace

▶

PID 42 can be a diﬀerent process in each namespace

▶

Directory

can be a diﬀerent directory in each namespace

▶

. . .

▶

Can be used to build application containers

without possibility to escape

▶

Usable without root access

Philipp Wendler 11 / 21

Benchmarking Containers

▶

Encapsulate groups of processes

▶

Limited resources (memory, cores)

▶

Total resource consumption measurable

▶

All other processes hidden

and no communication with them

▶

Disabled network access

▶

Adjusted ﬁle-system layout

▶

Private /tmp

▶

Writes redirected to

temporary RAM disk

Philipp Wendler 12 / 21

BenchExec

▶

A Framework for Reliable Benchmarking

and Resource Measurement

▶

Provides benchmarking containers

based on cgroups and namespaces

▶

Allocates hardware resources appropriately

▶

Low system requirements

(modern Linux kernel and cgroups access)

Philipp Wendler 13 / 21

BenchExec

▶

Open source: Apache 2.0 License

▶

Written in Python 3

▶

https://github.com/sosy-lab/benchexec

▶

Used in International Competition on Software Veriﬁcation

(SV-COMP) and by StarExec

▶

Originally developed

for software veriﬁcation,

but applicable to arbitrary tools

Philipp Wendler 14 / 21

BenchExec Architecture

runexec

· · ·

runexec

benchexec

Bench.

Def.

Input

Files

XML

Results

table-generator

HTML

Table

TSV

Data

BenchExec

runexec

Benchmarks a single run of a tool (in container)

benchexec

Benchmarks multiple runs

table-generator

Generates TSV and interactive HTML tables

Philipp Wendler 15 / 21

BenchExec: runexec

▶

Benchmarks a single run of a tool

▶

Measures and limits resources using cgroups

▶

Runnable as stand-alone tool and as Python module

▶

Easy integration into other benchmarking frameworks

and infrastructure

▶

Example:

runexec ––timelimit 100s ––memlimit 16GB

––cores 0-7,16-23 ––memoryNodes 0

–– <TOOL_CMD>

Philipp Wendler 16 / 21

BenchExec: runexec

Isolation

Resource Limitation /

Measurement

2 Process

Run

runexec

Isolation

Resource Limitation /

Measurement

2 Process

Run

runexec

CPU Cores

3 3 3 3

Memory

Philipp Wendler 17 / 21

BenchExec: benchexec

▶

Benchmarks multiple runs

(e.g., a set of conﬁgurations against a set of ﬁles)

▶

Allocates hardware resources

▶

Can check whether tool result is as expected

for given input ﬁle and property

Philipp Wendler 18 / 21

BenchExec: table-generator

▶

Aggregates results

▶

Extracts statistic values from tool output

▶

Generates TSV and interactive HTML tables (with plots)

▶

Computes result diﬀerences and regression counts

Philipp Wendler 19 / 21

Please Read More

Dirk Beyer, Stefan Löwe, and Philipp Wendler.

Reliable Benchmarking:

Requirements and Solutions. [1]

STTT 2019

▶

More details

▶

Study of hardware inﬂuence on benchmarking results

▶

Suggestions how to present results

(result aggregation, rounding, plots, etc.)

Philipp Wendler 20 / 21

Conclusion

Be careful when benchmarking!

Don’t use time, ulimit etc.

Always use cgroups and namespaces!

BenchExec

https://github.com/sosy-lab/benchexec

Philipp Wendler 21 / 21

References I

[1] Beyer, D., Löwe, S., Wendler, P.: Reliable benchmarking: Requirements and

solutions. Int. J. Softw. Tools Technol. Transfer 21(1), 1–29 (2019).

https://doi.org/10.1007/s10009-017-0469-y

Philipp Wendler 22 / 21

Assign Cores Deliberately

▶

Hyper Threading:

Multiple threads sharing execution units

▶

Shared caches

Philipp Wendler 23 / 21

Respect Non-Uniform Memory Access (NUMA)

▶

Memory regions have diﬀerent performance depending on

current CPU core

▶

Hierarchical NUMA makes things worse

Philipp Wendler 24 / 21

Type lstopo on your machine (Ubuntu: package hwloc)

CPU

memory region

core

Philipp Wendler 25 / 21

Cgroups

▶

Hierarchical tree of sets of processes

. . .

/user1

/benchmarks

/benchmarks/run1

5542 (bash)

5544 (ﬁrefox)

. . .

1130 (veriﬁer)

1131 (subprocess1)

. . .

Philipp Wendler 26 / 21

BenchExec Conﬁguration

▶

Tool command line

▶

Expected result

▶

Resource limits

▶

CPU time, wall time

▶

Memory

▶

Container setup

▶

Network access

▶

File-system layout

▶

Where to put result ﬁles

Philipp Wendler 27 / 21

Directory Access Modes

Read Write temp Write persistent

existing content content content

hidden ✗ ✓ ✗

read only ✓ ✗ ✗

overlay ✓ ✓ ✗

full access ✓ ✗ ✓

Philipp Wendler 28 / 21