Benchmark module

class Benchmark(batch_size, onnx_model=None, onnx_input=None, no_data_transfer=False, enot_backend_runner=<class 'enot_lite.benchmark.backend_runner.EnotBackendRunner'>, torch_model=None, torch_input=None, torch_cpu_runner=<class 'enot_lite.benchmark.backend_runner.TorchCpuRunner'>, torch_cuda_runner=<class 'enot_lite.benchmark.backend_runner.TorchCudaRunner'>, backends=Device.CPU, warmup=50, repeat=50, number=50, inter_op_num_threads=None, intra_op_num_threads=None, openvino_num_threads=None, verbose=True)

Open extendable tool for benchmarking inference.

It supports ENOT Lite and PyTorch backends out of the box, but can be extended for your own backends.

It measures inference time of ONNX models on ENOT Lite backends, PyTorch native inference time and transforms it to FPS (frame-per-second, the bigger the better) metric.

All benchmark source code is available in benchmark module.

Parameters

batch_size (Optional[int]) –
onnx_model (Union[str, ModelProto, None]) –
onnx_input (Optional[Any]) –
no_data_transfer (bool) –
enot_backend_runner (Optional[Type[BackendRunner]]) –
torch_model (Optional[Any]) –
torch_input (Optional[Any]) –
torch_cpu_runner (Optional[Type[BackendRunner]]) –
torch_cuda_runner (Optional[Type[BackendRunner]]) –
backends (Union[Device, List[Union[Tuple, BackendType]]]) –
warmup (int) –
repeat (int) –
number (int) –
inter_op_num_threads (Optional[int]) –
intra_op_num_threads (Optional[int]) –
openvino_num_threads (Optional[int]) –
verbose (bool) –

__init__(batch_size, onnx_model=None, onnx_input=None, no_data_transfer=False, enot_backend_runner=<class 'enot_lite.benchmark.backend_runner.EnotBackendRunner'>, torch_model=None, torch_input=None, torch_cpu_runner=<class 'enot_lite.benchmark.backend_runner.TorchCpuRunner'>, torch_cuda_runner=<class 'enot_lite.benchmark.backend_runner.TorchCudaRunner'>, backends=Device.CPU, warmup=50, repeat=50, number=50, inter_op_num_threads=None, intra_op_num_threads=None, openvino_num_threads=None, verbose=True)

Parameters

batch_size (Optional[int]) – Batch size value. This value should equals to onnx_input and torch_input batch sizes. Pass None if the model input does not contain batch size (for example natural language processing networks), in this case batch_size will be 1.
onnx_model (str, ModelProto or None) – Path to ONNX model for benchmarking on ENOT Lite backends. Omit this parameter to skip benchmarking of ENOT Lite backends.
onnx_input (Optional[Any]) – Input for ONNX model. If the model has only one input pass input as one value: onnx_input=np.random(...) for example. There are two options for passing the input when model has multiple inputs: as list (or tuple), or as mapping (dict), where the keys are input names, values are input tensors. Keys correspond to input names, values to input values.
no_data_transfer (bool) – Whether to do data transfer for every run (from CPU to GPU and back from GPU to CPU) or not. This parameter is ignored for CPU backends.
enot_backend_runner (Optional[Type[BackendRunner]]) – BackendRunner subclass that will be used for ENOT Lite backends. Default is EnotBackendRunner.
torch_model (Optional[Any]) – PyTorch model for native benchmarking (torch.nn.Module). Omit this parameter to skip benchmarking of PyTorch backends.
torch_input (Optional[Any]) – Input for PyTorch model.
torch_cpu_runner (Optional[Type[BackendRunner]]) – BackendRunner subclass that will be used for PyTorch backends. Default is TorchCpuRunner.
torch_cuda_runner (Optional[Type[BackendRunner]]) – BackendRunner subclass that will be used for PyTorch backends. Default is TorchCudaRunner.
backends (Union[Device, List[Union[Tuple, BackendType]]]) – Selects backends for benchmarking: Device.CPU - all CPU backends, Device.GPU - all GPU backends, Also you can specify backends by type or Tuple, for example: [BackendType.ORT_CUDA, (BackendType.ORT_TENSORRT, ModelType.YOLO_V5)]. Default is Device.CPU.
warmup (int) – Number of warmup iterations (see BackendBenchmark). Default is 50.
repeat (int) – Number of repeat iterations (see BackendBenchmark). Default is 50.
number (int) – Number of iterations in each repeat iteration (see BackendBenchmark). Default is 50.
inter_op_num_threads (Optional[int]) – Number of threads used to parallelize the execution of the graph (across nodes). Default is None (will be set by backend automatically). Affects on CPU backends only.
intra_op_num_threads (Optional[int]) – Number of threads used to parallelize the execution within nodes. Default is None (will be set by backend automatically). Affects on CPU backends only.
openvino_num_threads (Optional[int]) – Lenght of async task queue which is used in OpenVINO backend. Increase of this parameter can both improve performance and degrade it. Change it last to fine tune performance. Default is None (will be set by backend). Affects on CPU backends only.
verbose (bool) – Print status while benchmarking or not. Default is True.

Examples

ResNet-50 benchmarking.

>>> import numpy as np
>>> import torch
>>> from torchvision.models import resnet50
>>> from enot_lite.benchmark import Benchmark
>>> from enot_lite.type import BackendType

Create PyTorch ResNet-50 model.

>>> resnet50 = resnet50()
>>> resnet50.cpu()
>>> resnet50.eval()
>>> torch_input=torch.ones((8, 3, 224, 224)).cpu()

Export it to ONNX.

>>> torch.onnx.export(
...     model=resnet50,
...     args=torch_input,
...     f='resnet50.onnx',
...     opset_version=11,
...     input_names=['input'],
>>> )

Configure Benchmark.

>>> benchmark = Benchmark(
...     batch_size=8,
...     onnx_model='resnet50.onnx',
...     onnx_input={'input': np.ones((8, 3, 224, 224), dtype=np.float32)},
...     torch_model=resnet50,
...     torch_input=torch_input,
...     backends=[BackendType.ORT_CUDA, BackendType.ORT_TENSORRT_FP16],
>>> )

Run Benchmark and print results.

>>> benchmark.run()
>>> benchmark.print_results()

print_results()

Prints table with benchmarking results and environment information.

Return type: None

property results: Dict

Benchmarking results.

Returns: Keys are backend names, values are tuples with the following structure: FPS, normalized time in ms per sample, mean time in ms per batch, standard deviation in ms. Value can be None if benchmarking failed.
Return type: Dict

run()

Starts benchmarking.

Return type: None

Building blocks

class BackendBenchmark(warmup, repeat, number)

Benchmarks inference time of backends.

This class is a building block of Benchmark, it measures inference time for one backend.

All components: inference backend, model and input data should be wrapped into an object that implements BackendRunner interface to work with BackendBenchmark. We have already wrapped our and PyTorch backends, but you can extend benchmark by adding and registering builder in BackendRunnerFactory for your own backend.

Parameters

warmup (int) –
repeat (int) –
number (int) –

__init__(warmup, repeat, number)

To understand ctor parameters see benchmark() method.

Parameters

warmup (int) – Number of warmup steps before benchmarking.
repeat (int) – Number of repeat steps (see timeit.Timer).
number (int) – Number of inference calls in each repeat step (see timeit.Timer).

benchmark(backend_runner)

Benchmarks backend using BackendRunner interface.

There are two main steps:

warmup: calls run() method of BackendRunner warmup times
bechmark: calls run() method number × repeat times and stores execution time

The results of benchmarking are measured mean time per one batch (in ms) and standard deviation per one batch.

All measurements in benchmark step are done with the help of timeit.Timer object.

Parameters: backend_runner (BackendRunner) – Backend, model and input data wrapped in BackendRunner interface.
Returns: mean time per one batch (in ms), standard deviation per one batch (in ms).
Return type: Tuple[float, float]

class BackendRunnerFactory

Produces BackendRunner objects.

To extend Benchmark for your own backend, create builder and register it with the help of register_builder(). Builder is a callable object that wraps your backend, model and input data into BackendRunner object. You can see how we wrapped our and PyTorch backends in enot_lite.benchmark.backend_runner_builder module.

Note, BackendRunnerFactory is a singleton, to get an instance call constructor: BackendRunnerFactory().

__init__()

create(backend_type, **kwargs)

Creates new BackendRunner object by using registred builder for backend_type.

Parameters

backend_type (BackendType) – The type of the backend which factory should wrap and produce.
**kwargs – Arbitrary keyword arguments that will be passed to particular builder. This arguments should contain all information for successful object construction. Benchmark forms and passes these arguments to BackendRunnerFactory.

Returns

Return type

BackendRunner

register_builder(backend_type, builder)

Registers new BackendRunner builder for backend with backend_type.

Parameters

backend_type (BackendType) – The type of the backend for which new builder will be registered.
builder (Callable) – Builder that wraps backend, model and input data into BackendRunner object.

Return type

None

class BackendRunner

Interface that is used by BackendBenchmark. Only one method needs to be implemented: run(), which wraps inference call.

abstract run()

Wrapped inference call.

Return type: None

class EnotBackendRunner(backend_instance, onnx_input)

Common implementation of BackendRunner interface for ENOT Lite backends.

Do not override run() method, implement backend_run() instead.

Parameters

backend_instance (Backend) –
onnx_input (Dict[str, Any]) –

__init__(backend_instance, onnx_input)

Parameters

backend_instance (backend.Backend) – ENOT Lite backend with embedded model.
onnx_input (Dict[str, Any]) – Input for model inference (model is already wrapped in backend_instance).

backend_run(backend, onnx_input)

Common implementation of how to infer ONNX model.

Parameters

backend (backend.Backend) – ENOT Lite backend with embedded model.
onnx_input (Dict[str, Any]) – Model input.

Returns

Prediction.

Return type

Any

run()

Wrapped inference call.

Return type: None

class TorchCpuRunner(torch_model, torch_input)

Common implementation of BackendRunner interface for PyTorch on CPU.

Do not override run() method, implement torch_run() instead.

Parameters: torch_input (Any) –

__init__(torch_model, torch_input)

Parameters

torch_model (torch.nn.Module) – PyTorch model.
torch_input (torch.Tensor or something suitable for torch_model) – Input for torch_model.

run()

Wrapped inference call.

Return type: None

torch_run(model, inputs)

Common implementation of how to infer PyTorch model.

Parameters

model (torch.nn.Module) – PyTorch model.
inputs (Any) – Input for model.

Returns

Prediction.

Return type

Any

class TorchCudaRunner(torch_model, torch_input, no_data_transfer)

Common implementation of BackendRunner interface for PyTorch on CUDA.

Do not override run() method, implement torch_run(), torch_input_to_cuda(), torch_output_to_cpu() to extend this class.

Why are we explicitly transfering data from CPU to CUDA and from CUDA to CPU?

In real-world application, data (images, sentences, etc) is on the CPU device (in RAM, hard drive or CPU-caches), in the moment when you started inference, input data should be transferred through north and south bridges on your motherboard to CUDA device (GPU) to perform computations more effectively and decrease model inference latency. When prediction is computed, the output data should be transferred back from CUDA to CPU for further processing. Sometimes the data transfer time can be comparable to the inference time, so it must be taken into account in the benchmarking.

The data transfer described above is done automatically for ENOT Lite backends. For PyTorch on CUDA we explicitly measure the data transfer time from CPU to CUDA and back from CUDA to CPU to obtain consistent results.

Parameters: no_data_transfer (bool) –

__init__(torch_model, torch_input, no_data_transfer)

Parameters

torch_model (torch.nn.Module) – PyTorch model.
torch_input (torch.Tensor or something suitable for torch_model) – Input for torch_model.
no_data_transfer (bool) – Whether to do data transfer for every run (from CPU to GPU and back from GPU to CPU) or not.

run()

Wrapped inference call.

Return type: None

torch_input_to_cuda(torch_input)

Common implementation of how to transfer PyTorch model input from CPU to CUDA.

Parameters: torch_input (torch.Tensor) – Tensor on CPU device.
Returns: Tensor on CUDA device.
Return type: torch.Tensor

torch_output_to_cpu(torch_output)

Common implementation of how to transfer PyTorch output (prediction) from CUDA to CPU.

Parameters: torch_output (Union[torch.Tensor, Iterable]) – PyTorch output on CUDA device.
Returns: Irrespective of the results, this function only transfers them to CPU.
Return type: None
Raises: RuntimeError: – If some part of torch_output is not torch.Tensor or Iterable In this case user should implement transfering of this object.

torch_run(model, inputs)

Common implementation of how to infer PyTorch model.

Parameters

model (torch.nn.Module) – PyTorch model.
inputs (Any) – Input for model.

Returns

Prediction.

Return type

Any