How to cluster text documents using BERT

Now its easy to cluster text documents using BERT and Kmeans. We can apply the K-means algorithm on the embedding to cluster documents.

Similar sentences clustered based on their sentence embedding similarity.

We will use sentence-transformers package which wraps the Huggingface Transformers library.

It adds extra functionality like semantic similarity and clustering using BERT embedding.

Let’s see the basics first,

What is BERT?

Bidirectional Encoder Representations from Transformers is a technique for natural language processing pre-training developed by Google.

What is embedding?

Words or phrases of a document are mapped to vectors of real numbers called embeddings.

What is sentence-transformers?

This framework provides an easy method to compute dense vector representations for sentences and paragraphs.

What is clustering?

Clustering is a task of grouping a set of objects so that objects in the same group are similar to each other than the objects in other groups.

What is the K-means clustering algorithm?

K-means clustering algorithm aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Here the number of clusters k needs to be set first.

Now, let’s cluster some news headlines.

We have a few news headlines.

Some of them about sports, some are about commodity prices, some are for technology and others are business news.

We need to cluster these news headlines into different clusters.

Install sentence-transformers in colab,

!pip install -U sentence-transformers

If you want to install locally,

pip install -U sentence-transformers

Now, import sentence_transformers and sklearn kmeans,

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

Now create the SentenceTransformer object using the pre-trained model for STSB task.

embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

Let me put these news headlines into a list,

corpus = [
             "Vodafone Wins ₹ 20,000 Crore Tax Arbitration Case Against Government",
             "Voda Idea shares jump nearly 15% as Vodafone wins retro tax case in Hague",
             "Gold prices today fall for 4th time in 5 days, down ₹6500 from last month high",
             "Silver futures slip 0.36% to Rs 59,415 per kg, down over 12% this week",
             "Amazon unveils drone that films inside your home. What could go wrong?",
             "IPHONE 12 MINI PERFORMANCE MAY DISAPPOINT DUE TO THE APPLE B14 CHIP",
             "Delhi Capitals vs Chennai Super Kings: Prithvi Shaw shines as DC beat CSK to post second consecutive win in IPL",
             "French Open 2020: Rafael Nadal handed tough draw in bid for record-equaling 20th Grand Slam"
]

Generate embedding for each of the news headlines below,

corpus_embeddings = embedder.encode(corpus)

Now let’s cluster the text documents/news headlines using BERT.

Then, we perform k-means clustering using sklearn:

from sklearn.cluster import KMeans

num_clusters = 5
# Define kmeans model
clustering_model = KMeans(n_clusters=num_clusters)

# Fit the embedding with kmeans clustering.
clustering_model.fit(corpus_embeddings)

# Get the cluster id assigned to each news headline.
cluster_assignment = clustering_model.labels_

Here I want 5 clusters, therefore num_clusters = 5.

If you want to see which cluster-id each news headline assigned to, run below,

print(cluster_assignment)

which gives below output,

array([2, 2, 1, 1, 3, 4, 0, 0], dtype=int32)

This means document_1 assigned to cluster id 2.

Document_2 assigned to cluster id 2.

and so on..

Now, lets see each cluster and its news headlines,

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")

This shows the clusters as below,

Cluster  1
['Delhi Capitals vs Chennai Super Kings: Prithvi Shaw shines as DC beat CSK to post second consecutive win in IPL', 'French Open 2020: Rafael Nadal handed tough draw in bid for record-equaling 20th Grand Slam']

Cluster  2
['Gold prices today fall for 4th time in 5 days, down ₹6500 from last month high', 'Silver futures slip 0.36% to Rs 59,415 per kg, down over 12% this week']

Cluster  3
['Vodafone Wins ₹ 20,000 Crore Tax Arbitration Case Against Government', 'Voda Idea shares jump nearly 15% as Vodafone wins retro tax case in Hague']

Cluster  4
['Amazon unveils drone that films inside your home. What could go wrong?']

Cluster  5
['IPHONE 12 MINI PERFORMANCE MAY DISAPPOINT DUE TO THE APPLE B14 CHIP']

What these clusters about? let’s see.

Cluster–1 is about sports news like Cricket and Tennis.

While Cluster–2 is about commodity prices like gold and silver.

Cluster–3 is about business news like Vodafone company.

Cluster-4 and 5 are about technology/gadget news.

As you can see we can get some meaningful clusters using BERT embeddings.

Here is the Google colab link.

Happy learning 🙂

My other articles about BERT,

How to do semantic document similarity using BERT

Zero-shot classification using Huggingface transformers

Summarize text document using transformers and BERT

Follow me on Twitter, Instagram, Pinterest for new post notification.

Now, let’s cluster some news headlines.

By Satyanarayan Bhanja

One reply on “How to cluster text documents using BERT”

Leave a Reply Cancel reply