Faster transformer NLP pipeline using ONNX

See how ONNX can be used for faster CPU inference performance using the Huggingface transformer NLP pipeline with few changes.

Now some overview about the terms here,

What is transformers?

Transformers provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

What is transformers pipeline?

Transformer pipeline is the simplest way to use pretrained SOTA model for different types of NLP task like sentiment-analysis, question-answering, zero-shot classification, feature-extraction, NER etc. using two lines of code.

What is ONNX?

ONNX stands for Open Neural Network Exchange. ONNX Runtime is a cross-platform inferencing and training accelerator compatible with many popular ML/DNN frameworks, including PyTorch, TensorFlow/Keras, scikit-learn etc.

What is advantage of ONNX runtime?

– Improve inference performance for a different types of ML models.
– Reduce time and cost of training large models
– Train in Python but deploy into a C#/C++/Java app
– Run on different hardware and operating systems
– Support models created in several different frameworks

Now you have a overview of ONNX.

Let’s see how to use ONNX for faster transformer NLP pipeline.

Install transformers and onnx_transformers in Colab.

!pip install transformers
!pip install git+

Import pipeline function from onnx_transformers

from onnx_transformers import pipeline

Now, let’s use for various NLP tasks,


Add onnx = True in pipeline function. This is similar to original pipeline, only onnx param is added.

nlp = pipeline("sentiment-analysis", onnx=True)

Lets use nlp to get sentiment of a text,

nlp("I like this combo of chicken starters!")

This, gives below result,

[{'label': 'POSITIVE', 'score': 0.9932807683944702}]

Now, lets see the inference speed,

%timeit nlp("I like this combo of chicken starters!")
10 loops, best of 3: 23.7 ms per loop

This is fast for a CPU inference 🙂

Question answering:

Question-answering is the task of extracting answers from a tuple of a candidate paragraph and a question.

Define the QA pipeline,

nlp_qa = pipeline('question-answering', onnx=True)

Let’s try this QA pipeline,

nlp_qa(context='Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware.Google corporate headquarters located at Mountain View, California, United States.', 
       question='Where is Google based?')

This gives below answer,

{'answer': 'Mountain View, California,',
 'end': 291,
 'score': 0.4882817566394806,
 'start': 265}

Now, lets check inference speed for QA ONNX pipeline,

%timeit nlp_qa(context='Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware.Google corporate headquarters located at Mountain View, California, United States.', question='Where is Google based?')
1 loop, best of 3: 230 ms per loop
Now let’s try a different transformer model called mrm8488/bert-tiny-finetuned-squadv2 for QA,
nlp_qa = pipeline('question-answering', model="mrm8488/bert-tiny-finetuned-squadv2", onnx=True)

nlp_qa(context='Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware.Google corporate headquarters located at Mountain View, California, United States.', 
       question='Where is Google based?')


{'answer': 'Mountain View, California, United States.',
 'end': 305,
 'score': 0.017995649948716164,
 'start': 265}

Bert-tiny model QA inference performance,

%timeit nlp_qa(context='Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware.Google corporate headquarters located at Mountain View, California, United States.', question='Where is Google based?')
10 loops, best of 3: 141 ms per loop


Feature-extraction pipeline extracts the hidden states from the base transformer, which can be used as features in downstream tasks.

Define the feature-extraction pipeline,

nlp = pipeline("feature-extraction", onnx= True)

Let’s extract features for this text,

nlp('Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware.Google corporate headquarters located at Mountain View, California, United States.')

It gives a tensor representation for the above sequence.

Named Entity Recognition:-

This pipeline extracts named entities for each word in the input sequence.

Define the NER pipeline,

nlp = pipeline("ner", onnx=True)

Extract named entities for below sequence,

nlp('Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware.Google corporate headquarters located at Mountain View, California, United States.')

Here are the entities,

[{'entity': 'I-ORG',
  'index': 1,
  'score': 0.9994143843650818,
  'word': 'Google'},
 {'entity': 'I-ORG', 'index': 2, 'score': 0.9844746589660645, 'word': ','},
 {'entity': 'I-ORG', 'index': 3, 'score': 0.998744547367096, 'word': 'LLC'},
 {'entity': 'I-MISC',
  'index': 6,
  'score': 0.9970664381980896,
  'word': 'American'},
 {'entity': 'I-MISC',
  'index': 13,
  'score': 0.9974018931388855,
  'word': 'Internet'},
 {'entity': 'I-ORG',
  'index': 38,
  'score': 0.9973472356796265,
  'word': 'Google'},
 {'entity': 'I-LOC',
  'index': 43,
  'score': 0.9949518442153931,
  'word': 'Mountain'},
 {'entity': 'I-LOC', 'index': 44, 'score': 0.9973859786987305, 'word': 'View'},
 {'entity': 'I-LOC',
  'index': 46,
  'score': 0.9987567067146301,
  'word': 'California'},
 {'entity': 'I-LOC',
  'index': 48,
  'score': 0.9979965686798096,
  'word': 'United'},
 {'entity': 'I-LOC',
  'index': 49,
  'score': 0.9937295317649841,
  'word': 'States'}]


Zero-shot-classification model classify data which the model never seen.

Define zero-shot classification model,

classifier = pipeline("zero-shot-classification", onnx=True)
sequence = "For any budding cricketer, playing with or against MS Dhoni is a big deal"
candidate_labels = ["cricket", "football", "basketball"]
classifier(sequence, candidate_labels)

See how it correctly classifies the sequence as Cricket,

{'labels': ['cricket', 'basketball', 'football'],
 'scores': [0.9873027801513672, 0.00657124537974596, 0.006125985644757748],
 'sequence': 'For any budding cricketer, playing with or against MS Dhoni is a big deal'}

Here is the colab link,

