Benchmark

Here we have presented benchmark results for some selected neural networks.

Our benchmark is an open simple benchmark that measures inference time of ONNX models on ENOT-Lite backends versus PyTorch native inference time and transforms it to FPS (frame-per-second, the bigger the better) metric.

All values in tables below are given in FPS. For natural language processing neural networks FPS = QPS.

Benchmarks:

ResNet-50

input: (batch_size, 3, 224, 224)
batch_size = 1

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

473.4

1479.7

1849.6

340.6

209.1

RTX 2080 Ti

514.1

1142.7

1463.7

317.4

215.7

GTX 1080 Ti

439.5

438.9

882.7

282.5

231.2

batch_size = 16

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

1179.1

3454.8

6381.4

973.8

624.3

RTX 2080 Ti

1198.6

3220.8

4713.9

842.3

770.4

GTX 1080 Ti

994.4

970.4

2620.5

935.1

595.6

MobileNetV2

input: (batch_size, 3, 224, 224)
batch_size = 1

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

1188.4

2524.5

2934.6

779.1

254.0

RTX 2080 Ti

1134.6

1952.4

2287.7

658.3

203.0

GTX 1080 Ti

1122.2

1123.9

1647.9

649.3

343.3

batch_size = 16

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

3567.6

7117.5

11275.7

1855.5

1746.2

RTX 2080 Ti

3171.4

5239.8

6434.3

2038.3

1855.3

GTX 1080 Ti

3120.8

3113.4

6305.3

1411.6

1344.6

MobileNetV2-SSD

input: (batch_size, 3, 224, 224)
batch_size = 1

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

411.1

597.4

622.7

105.7

119.3

RTX 2080 Ti

369.6

451.4

445.8

107.3

79.1

GTX 1080 Ti

421.2

419.3

483.5

159.6

126.2

batch_size = 16

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

1520.1

2045.1

2419.5

211.9

230.6

RTX 2080 Ti

1111.8

1349.5

1411.0

238.0

222.9

GTX 1080 Ti

1485.7

1482.8

2128.6

275.3

256.7

YOLOv5s

input: (batch_size, 3, 640, 640)
batch_size = 1

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

236.5

514.9

601.3

158.8

148.5

RTX 2080 Ti

255.1

412.5

441.4

172.0

84.5

GTX 1080 Ti

201.8

201.4

281.9

127.3

111.6

batch_size = 16

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

316.4

637.9

777.9

196.0

120.7

RTX 2080 Ti

331.6

573.8

649.4

243.6

126.5

GTX 1080 Ti

268.9

268.1

440.4

170.0

138.4

ViT

Vision Transformer (ViT), patch = 16, resolution = 224.

input: (batch_size, 3, 224, 224)
batch_size = 1

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

175.2

318.1

310.6

172.9

175.7

RTX 2080 Ti

159.6

373.2

374.9

175.3

132.6

GTX 1080 Ti

123.1

122.5

122.3

135.5

108.3

batch_size = 16

Device / Backend

TensorRT Float

TensorRT Float16

TensorRT Int8

ONNX CUDA

Torch CUDA

RTX 3060 Ti

283.6

592.8

595.3

291.3

279.7

RTX 2080 Ti

127.4

435.8

410.3

153.7

123.7

GTX 1080 Ti

169.6

168.4

166.4

182.7

166.0

BERT

input length: 1941 characters

Device / Backend

TensorRT Float

TensorRT Float16

ONNX CUDA

Torch CUDA

RTX 3060 Ti

119.8

220.3

99.2

91.8

RTX 2080 Ti

100.0

257.0

94.7

73.4

GTX 1080 Ti

43.9

39.4

21.8

25.1