Unleashing the Power of ONNX for Speedier SBERT Inference

Swaraj Patil
3 min readSep 25, 2023

SBERT, also known as Sentence-Bert, is a widely used approach for obtaining sentence embeddings that aim to retain the contextual information within the sentences. However, generating these embeddings can be slow when dealing with large amounts of data. To address this, one option is to utilize batch-based encoding to accelerate the inference. However, this may not necessarily reduce the inference time. In this Medium blog post, we will explore the application of the ONNX (Open Neural Network Exchange) framework and how it aids in reducing the inference time of the model.

P.S. This article does not delve into the internal workings of ONNX. For more in-depth information, please consult the official ONNX documentation.

Let’s begin by installing the import libraries. We can use pip for the installation of ONNX

pip install onnx
pip install onnxruntime-gpu
pip install transformers
pip install torch

Once ONNX is installed we verify it using the below snippet

In order to obtain sentence embeddings, we will utilize the IMDB dataset sourced from Kaggle. Specifically, we will focus on the “Overview of Movie” column to generate embeddings using SBERT. The time needed to create embeddings will be determined for the 1000 sentences present in the dataset.

We will perform two experiments here on both CPU and GPU

  • Inference time for 1000 sentences using Vanilla SBERT (CPU).
  • Inference time for 1000 sentences using ONNX converted SBERT (CPU).
  • Inference time for 1000 sentences using Vanilla SBERT (GPU).
  • Inference time for 1000 sentences using ONNX converted SBERT (GPU).

The Sentence BERT model that we would consider here is all-MiniLM-L6-v2

We can invoke the Sentence BERT model from the Hugging Face Library and the Sentence Transformer Library. The output embeddings from both the library will be the same. For our experiments, we will use the Hugging Face library. Remember that when we use the Hugging Face library after obtaining the embeddings, additional post-processing could be needed such as Pooling or Normalization. The different steps can be obtained from the model page on Hugging Face. Perform those steps to get final sentence embeddings.

Let's first convert the model to ONNX format.

Now that we have converted the Sentence BERT Model. Let’s get the stats for the models.

Vanilla SBERT (CPU)

The inference time obtained for the Vanilla SBERT model on the CPU can be found using the snippet below.

100%|██████████| 1000/1000 [00:36<00:00, 27.62it/s]
PyTorch cpu Inference time = 34.2605 ms

ONNX Converted SBERT (CPU)

The inference time obtained for the ONNX SBERT model on the CPU can be found using the below snippet.

100%|██████████| 1000/1000 [00:16<00:00, 60.80it/s]
OnnxRuntime cpu Inference time = 15.5696 ms

Outputs

outputs_cpu[0][:,:10]    ## Vanilla SBERT CPU Output

array([[-0.06326339, 0.0414625 , -0.04707527, -0.03361899, -0.02562934,
0.03499832, 0.00804075, -0.05042004, 0.00215668, -0.03816812]],
dtype=float32)
ort_outputs_cpu[0][:,:10]   ## Onnx SBERT CPU Output

array([[-0.06326343, 0.04146247, -0.04707528, -0.033619 , -0.02562926,
0.03499835, 0.0080408 , -0.05042008, 0.00215669, -0.03816817]],
dtype=float32)

Vanilla SBERT (GPU)

The inference time obtained for the Vanilla SBERT model on the GPU can be found using the snippet below.

100%|██████████| 1000/1000 [00:07<00:00, 135.29it/s]
PyTorch cuda Inference time = 6.737 ms

ONNX Converted SBERT (GPU)

The inference time obtained for the ONNX SBERT model on the GPU can be found using the snippet below.

100%|██████████| 1000/1000 [00:02<00:00, 373.49it/s]
OnnxRuntime cuda Inference time = 1.9466 ms

Outputs

outputs_gpu[0][:,:10]   ## Vanilla SBERT GPU 

array([[-0.06326333, 0.04146247, -0.0470753 , -0.03361904, -0.02562935,
0.03499833, 0.00804079, -0.05042002, 0.00215669, -0.03816818]],
dtype=float32)
ort_outputs_gpu[0][:,:10]  ## ONNX SBERT GPU

array([[-0.06326336, 0.04146249, -0.04707528, -0.03361899, -0.02562931,
0.03499832, 0.0080408 , -0.05042004, 0.00215668, -0.03816817]],
dtype=float32)

Summary Table

Conclusion

Based on the results obtained we can see that the ONNX-converted model takes significantly less time to get the sentence embedding without any loss in the data. The experiments were conducted on Google Colab with T4 GPU. Similar or better results can be expected from other hardware as well.

In ONNX, we can also have a Quantised version of SBERT. The quantized version would have int8 dtype. One can explore that as well. The Jupyter notebook for the complete experiments is added in the GitHub repo below for further reference.

References

--

--