How to do semantic document similarity using BERT

Let’s see, how to do semantic document similarity using BERT.

To get semantic document similarity between two documents, we can get the embedding using BERT. Next, calculate the cosine similarity score between the embeddings.

What is cosine similarity?

The cosine similarity is a distance metric to calculate the similarity of two documents. The cosine similarity of vectors/embeddings of documents corresponds to the cosine of the angle between vectors.

The smaller the angle between vectors, the higher the cosine similarity. Also, the vectors are nearby and point in the same direction.

The vectors point in a different direction when the angle is getting higher.

What is BERT?

Bidirectional Encoder Representations from Transformers is a technique for natural language processing pre-training developed by Google.

What is semantic similarity?

The semantic similarity of two text documents is the process of determining, how two documents are contextually similar.

What is embedding?

Words or phrases of a document are mapped to vectors of real numbers called embeddings.

What is sentence-transformers?

This framework provides an easy method to compute dense vector representations for sentences and paragraphs.

Now, let’s see the similarity between news headlines

We have a few headlines, which are of the category “business”, “commodity prices”, “technology” and “sports”.

We need to find similar news for each of the news headlines.

Let’s proceed,

Install sentence-transformers,

!pip install -U sentence-transformers

Import necessary python packages,

from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity

Our news headlines as below. Put them in a list as below,

documents = [
             "Vodafone Wins ₹ 20,000 Crore Tax Arbitration Case Against Government",
             "Voda Idea shares jump nearly 15% as Vodafone wins retro tax case in Hague",
             "Gold prices today fall for 4th time in 5 days, down ₹6500 from last month high",
             "Silver futures slip 0.36% to Rs 59,415 per kg, down over 12% this week",
             "Amazon unveils drone that films inside your home. What could go wrong?",
             "IPHONE 12 MINI PERFORMANCE MAY DISAPPOINT DUE TO THE APPLE B14 CHIP",
             "Delhi Capitals vs Chennai Super Kings: Prithvi Shaw shines as DC beat CSK to post second consecutive win in IPL",
             "French Open 2020: Rafael Nadal handed tough draw in bid for record-equaling 20th Grand Slam"
]

Define model, use a pre-trained BERT model, which is fine-tuned for similar kinds of tasks.

Here we will use the bert-base model fine-tuned for the NLI dataset.

model = SentenceTransformer('bert-base-nli-mean-tokens')

Now, create the embedding for the news headlines,

text_embeddings = model.encode(documents, batch_size = 8, show_progress_bar = True)

Lets check the shape of text_embeddings,

np.shape(text_embeddings)

Which shows (8, 768), it means there are a total of 8 documents, and each document vector if of length 768.

Now, let’s find the cosine similarity between text_embeddings.

similarities = cosine_similarity(text_embeddings)
print('pairwise dense output:\n {}\n'.format(similarities))

This gives the below, similarity matrix,

pairwise dense output:
 [[0.9999999  0.73151094 0.6046201  0.61174655 0.28593335 0.28101337
  0.5809742  0.60881454]
 [0.73151094 0.9999997  0.5602956  0.60428786 0.23841025 0.30871877
  0.5798251  0.5946494 ]
 [0.6046201  0.5602956  1.         0.85126555 0.2170505  0.33635983
  0.44095552 0.42231077]
 [0.61174655 0.60428786 0.85126555 1.0000001  0.24565016 0.39271736
  0.44883895 0.46855572]
 [0.28593335 0.23841025 0.2170505  0.24565016 1.0000004  0.34194955
  0.22930798 0.28988248]
 [0.28101337 0.30871877 0.33635983 0.39271736 0.34194955 1.0000002
  0.30893183 0.27376795]
 [0.5809742  0.5798251  0.44095552 0.44883895 0.22930798 0.30893183
  0.9999999  0.646995  ]
 [0.60881454 0.5946494  0.42231077 0.46855572 0.28988248 0.27376795
  0.646995   1.0000002 ]]

Now sort this 2D NumPy array using argsort()

similarities_sorted = similarities.argsort()

Similarities_sorted contains the index of sorted similarities values.

array([[5, 4, 6, 2, 7, 3, 1, 0],
       [4, 5, 2, 6, 7, 3, 0, 1],
       [4, 5, 7, 6, 1, 0, 3, 2],
       [4, 5, 6, 7, 1, 0, 2, 3],
       [2, 6, 1, 3, 0, 7, 5, 4],
       [7, 0, 1, 6, 2, 4, 3, 5],
       [4, 5, 2, 3, 1, 0, 7, 6],
       [5, 4, 2, 3, 1, 0, 6, 7]])

Now, get pairwise simialrity score,

# Get list of similarity indices i.e. doc at index 0 simialr with doc at index 1169 below.
id_1 = []
id_2 = []
score = []
for index,array in enumerate(similarities_sorted):
    id_1.append(index)
    id_2.append(array[-2])
    score.append(similarities[index][array[-2]])

index_df = pd.DataFrame({'id_1' : id_1,
                          'id_2' : id_2,
                          'score' : score})

Here is the index_df,

id_1	id_2	score
0	1	0.731511
1	0	0.731511
2	3	0.851266
3	2	0.851266
4	5	0.341950
5	3	0.392717
6	7	0.646995
7	6	0.646995

pair-wise document similarity df

As you can see documents 0 and 1 are about Vodafone business news. Hence they have a high similarity score.

Documents 2 and 3 are about gold and silver price in the financial market. Thus they have a 0.85 similarity score.

Documents 4 and 5 are tech news but the context is different here. Hence the low similarity scores. Same with documents 6 and 7.

Here is the colab link.

Happy learning 🙂

My other article about BERT,

https://theaidigest.in/zero-shot-classification-using-huggingface-transformers-pipeline/

https://theaidigest.in/summarize-text-document-using-transformers-and-bert/

Follow me on Twitter, Instagram, Pinterest for new post notification.

Now, let’s see the similarity between news headlines

By Satyanarayan Bhanja

2 replies on “How to do semantic document similarity using BERT”

Leave a Reply Cancel reply